Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • by() processing with string vs. numeric variable

    I am using the user-written egen_inequal package which calculates inequality statistics such as the Gini. I am calculating the Gini for by-groups. If the by-groups are defined using a numeric variable, I can get the Gini. But if the by-groups are defined using a string variable, all I get are missing values.

    Some questions:
    1. Is it common for the by() option only to work with numeric variables?
    2. If my groups are defined by a string variable, what can I do to get the results I want?

    Here is some example code.
    --------------------------------------
    ssc install egen_inequal
    sysuse auto, clear

    /* foreign is a numeric variable.
    Let's create a string variable Foreign representing the same information. */
    gen Foreign="Foreign" if foreign==1
    replace Foreign="Domestic" if foreign==0

    egen gini=gini(price), by(foreign)
    /* This works fine, because foreign is numeric. */

    drop gini
    egen gini=gini(price), by (Foreign)
    /* This generates missing values, because Foreign is a string. */

  • #2
    And while we're talking about by() groups, I have another question. We have written a command that can use the by() option. However, it can only take one by variable. Is this a common problem? Is there an easy way to make by() take multiple variables?

    Comment


    • #3
      1. Is it common for the by() option only to work with numeric variables?
      It is not common, but it is not unheard of.

      2. If my groups are defined by a string variable, what can I do to get the results I want?
      You can -encode- the string variable and use the -encode-d version, or, if there are too many values for that to be sensible, run -egen numeric_var = group(string_var)- , and then use the numeric_var in the -by()- option. Both -encode- and -egen, group()- are fundamental to data management in Stata. Your time spent reading the corresponding sections in the [D] user manual will be amply repaid.

      We have written a command that can use the by() option. However, it can only take one by variable. Is this a common problem? Is there an easy way to make by() take multiple variables?
      The built-in -by:- prefix in Stata accommodates string and numeric variables, single variables and multiple variables, all with equal ease. However, the -by()- option of a program will work exactly the way the author of the program specifies in the code. Since you are the programmer here, you, yourself, set it up to handle only one variable. There may have been good reasons to do that: perhaps it wouldn't make sense with multiple variables. (For example, the -by()- option in -ttest- would be nonsensical with multiple variables.) But if it does make sense in your context, then only you would know how to revise your code so as to relax that restriction. In any case, if a user of the program wanted the -by()- to effectively be defined by a list of variables, that can be accomplished easily by creating a new single variable that defines the groups by running -egen new_group_variable = group(list_of_variables_that_define_group).

      Comment


      • #4
        With respect to the egen_inequal package, this is a bug in the program. Since this is a pretty old package (2006), I'm not sure the author is still maintaining it but I would try to bring this to his attention. The bug can easily be fixed by modifying the following line of "_ginequal.ado":

        Code:
        markout `touse' `badinc' `by';
        to
        Code:
        markout `touse' `by', strok;
        With respect to your other questions, in Stata, observation groups are defined using a varlist and this varlist can include variables of mixed type, both numeric and string. See help byable for more information on how to make a program byable.

        Comment


        • #5
          Robert Picard : Thank you. That would fix the problem for the Gini, I guess. Would similar fixed need to be made to other egen_inequal files that calculate other inequality statistics?

          Comment


          • #6
            Yes, if they show the same problem. But Clyde in #3 already indicated a much easier work-aroud:

            Code:
             
            egen numeric_var = group(string_var), label
            I have added advice to use the label option.

            Comment


            • #7
              Hello all,

              I haven't been active on the statalist for long enough so that I only registered on this new forum today :$

              paulvonhippel : The fix that Robert suggested indeed resolves the issue and all measures will start accepting string groups as all work is done only in a single _ginequal.ado file. As suggested, creating group variable, or converting your strings to numeric will give you a quick work around (or you can of course change your local copy of _ginequal.ado) but I'll try to submit the update so hopefully you'll see it soon on SSC.

              Best,
              Zurab

              Comment


              • #8
                Zurab Sajaia : Thanks! I'll look for it....

                Comment

                Working...
                X