Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String variable in regression

    I have a variable called Q9 which has either identical string values or empty cells. I want to generate a new variable called employment which takes a value of 1 if Q9 has a string value or a value of 0 if Q9 has a blank cell. I used the following codes:

    Code:
    encode Q9, gen(employment)
    probit dep_var employment indep_var1 indep_var2
    And got the following results:

    Code:
    Probit regression                               Number of obs     =        318
                                                    LR chi2(2)        =       6.91
                                                    Prob > chi2       =     0.0315
    Log likelihood = -93.626457                     Pseudo R2         =     0.0356
    
    ------------------------------------------------------------------------------
    dep_var |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
        indep_var1 |   .6149843   .2511377     2.45   0.014     .1227634    1.107205
     indept_var2 |   .1176468   .3881169     0.30   0.762    -.6430484     .878342
      employment |          0  (omitted)
           _cons |    .435061    .529323     0.82   0.411     -.602393    1.472515
    ------------------------------------------------------------------------------
    
    .
    end of do-file
    By the way, empty cells in a string variable indicate missing values...right? I ran the following code to check, but it did not show the existence of any missing values!

    Code:
    di missing(Q9)
    0
    Last edited by Taz Raihan; 04 Sep 2018, 13:42.

  • #2

    The -encode- command does not generate a zero corresponding to the missing values. It generates a missing value. So your variable employment will have a value of 1 in some observations, and be missing in all others. From -probit-'s perspective, the observations with missing values of any of the variables mentioned in the command must be omitted from the analysis, so within the estimation sample -probit- works with, your "variable" employment is just a constant, 1. That's why it gets omitted. You need an actual 0/1 variable. The simplest way to do that is forget about encode here and run:

    Code:
    gen byte employment = !missing(Q9)
    This version of the employment variable should work nicely with -probit-.

    Your -di missing(Q9)- command does not do what you think it does. Remember that Q9 is a variable that takes on potentially different values in every observation. But -display- only displays one number. When -display- is confronted with an expression involving a variable, it defaults to the first value of that variable. So Stata looks at the value of Q9 in the first observation and then determines whether it is a missing value. If it were, it would respond to your -di- command with 1. As it turns out, the first value of Q9 is not missing, so Stata responds with 0.

    If you want to count the number of observations with missing values of Q9, that would be:
    Code:
    count if missing(Q9)
    If you want to -browse- all the obsevations which have missing values of Q9, that would be:
    Code:
    browse if missing(Q9)
    etc.

    Comment


    • #3
      Clyde Schechter thank you, sir. I tried your code, but it generated a variable called "employment" with the values for all the observations equal to 1. Is it possible for Stata cells to be appearing empty while it is actually not and therefore, producing this kind of result?

      Comment


      • #4
        Yes. What you see as an empty cell might have blank spaces or non-printing characters. To get rid of blank spaces:

        Code:
        replace Q9 = trim(itrim(Q9))
        replace Q9 = "" if Q9 == " "
        If Q9 contains non-printing characters, then you need to find those and remove them one at a time by looping over their values.
        Code:
        charlist Q9
        return list
        The ascii codes of all of the characters in any observation in Q9 will be shown in r(ascii). Disregard those that correspond to normal printing characters and then loop over the remaining ones using the -subinstr()- function to replace all occurrences of each of those characters with a null string (""). That will clean up Q9.

        Comment


        • #5
          I wonder whether you wish something like this:

          Code:
          . input str3 var1
          
                    var1
            1. 1
            2. ""
            3. B
            4. C
            5. 1
            6. ""
            7. end
          
          . list
          
               +------+
               | var1 |
               |------|
            1. |    1 |
            2. |      |
            3. |    B |
            4. |    C |
            5. |    1 |
               |------|
            6. |      |
               +------+
          
          . encode var1, gen(Q9)
          
          . list
          
               +-----------+
               | var1   Q9 |
               |-----------|
            1. |    1    1 |
            2. |         . |
            3. |    B    B |
            4. |    C    C |
            5. |    1    1 |
               |-----------|
            6. |         . |
               +-----------+
          
          . replace Q9 = 0 if missing(Q9)
          (2 real changes made)
           
          . replace Q9 = 1 if Q9 !=0
          (2 real changes made)
          
          . list
          
               +-----------+
               | var1   Q9 |
               |-----------|
            1. |    1    1 |
            2. |         0 |
            3. |    B    1 |
            4. |    C    1 |
            5. |    1    1 |
               |-----------|
            6. |         0 |
               +-----------+
          Best regards,

          Marcos

          Comment


          • #6
            I think given the possible complications of invisible characters, Marcos' solution is simpler than mine and you should go with that.

            Comment


            • #7
              I followed Marcos' solution. But issuing the following command generated a variable called employment which again has blank cells for the observations which had blank cells in my original string variable called Q9. Unlike Marcos', the newly created variable emplyoment does not give "." for missing blank cells.

              Code:
               
               encode Q9, gen(employment)
              Code:
              encode Q9_d, gen(employment)
              
              . 
              . 
              . list employment
              
                   +----------------------------------------------+
                   |                                   employment |
                   |----------------------------------------------|
                1. |                                              |
                2. |                                              |
                3. |                                              |
                4. |                                              |
                5. |                                              |
                   |----------------------------------------------|
                6. |                                              |
                7. |                                              |
                8. |                                              |
                9. |                                              |
               10. |                                              |
                   |----------------------------------------------|
               11. | d)       Increased employment (direct or indirect) |
               12. | d)       Increased employment (direct or indirect) |
               13. | d)       Increased employment (direct or indirect) |
               14. |                                              |
               15. |                                              |
                   |----------------------------------------------|

              Comment


              • #8
                Use the -nolabel- option with list:
                Code:
                list employment, nolabel
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  I kindly recommend to follow this FAQ advice:

                  12.1 What to say about your commands and your problem

                  Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!
                  If you are having different results, this may be due to a particularity of the data set or the commands you used. The best approach now is doing the necessary step: please share some excerpt of the data (you may use CODE delimiters or - dataex - for that matter.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Here is the data excerpt.

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str44 Q9
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    "d)     Increased employment (direct or indirect)"
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    " "                                          
                    "d)     Increased employment (direct or indirect)"
                    "d)     Increased employment (direct or indirect)"
                    " "                                          
                    end
                    Then I used the following command:

                    Code:
                    encode Q9, gen(employment)
                    
                    list employment
                    ----------------------------------------------+
                         |                                   employment |
                         |----------------------------------------------|
                      1. |                                              |
                      2. |                                              |
                      3. |                                              |
                      4. |                                              |
                      5. |                                              |
                         |----------------------------------------------|
                      6. |                                              |
                      7. |                                              |
                      8. |                                              |
                      9. |                                              |
                     10. |                                              |
                         |----------------------------------------------|
                     11. | d)       Increased employment (direct or indirect) |
                     12. | d)       Increased employment (direct or indirect) |
                     13. | d)       Increased employment (direct or indirect) |
                     14. |                                              |
                     15. |                                              |
                         |----------------------------------------------|

                    Comment


                    • #11
                      You must take in mind that, actually, you didn't type "", but " ", I mean, you have extra blank spaces between quotes. If you test my previous command without these blank spaces between quotes, I believe it will work fine.
                      Best regards,

                      Marcos

                      Comment


                      • #12
                        Or, if the full data are as you show in your -dataex- example, then you can just do:
                        Code:
                        gen byte employment = Q9 != " "
                        and you will get a 0/1 variable for employment.

                        It is important to understand that, for string variables, "" is a missing value, but " " is not a missing value. " " is a one-character string that contains a blank space. In the Browser or in listings you cannot see the difference between "" and " ", but the difference is very real, and Stata commands are sensitive to it.

                        Whoever created your source data file engaged in a deprecated practice by using " " to denote no increase in employment (or for that matter to ever use " " as a value for anything except where text from other sources is being copied in.) It sets a trap for the unwary user and, even if the wary user avoids the trap, it creates unnecessary work.
                        Last edited by Clyde Schechter; 05 Sep 2018, 11:54.

                        Comment


                        • #13
                          Clyde gave a clever approach.

                          If you want to stick to the command shared in #5, you just need to type beforehand:

                          Code:
                          replace Q9 = "" if Q9 ==" "
                          Best regards,

                          Marcos

                          Comment


                          • #14
                            Thank you so much for all the valuable inputs.

                            Comment

                            Working...
                            X