Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wrong number of variables created from tabulate command

    Hi everyone,

    I have just run the code
    Code:
    tabulate kode_melder, generate(melder_)
    where kode_melder has values 1-23. An example from the dataset is
    Code:
    input str10 id_lnr byte(kode_melder melder_1 melder_2 melder_3 melder_4 melder_5 melder_6 melder_7 melder_8 melder_9 melder_10 melder_11)
    "idlnr1" 6 0 0 1 0 0 
    "idlnr2" 4 0 0 0 0 0
    "idlnr3" 11 0 0 0 0 0
    "idlnr4" 23 0 0 0 0 0
    The problem that I'm having is that Stata (version 16.1) is only creating indicator variables up to number 22, even though kode_melder has values up to 23. So, I end up with melder_1 - melder_22, but don't understand why melder_23 is not created. Does anyone know how to fix this?

    Thank you!

  • #2
    Stata is willing to create 23 indicator variables this way:

    Code:
    . clear
    
    . set obs 23
    Number of observations (_N) was 0, now 23.
    
    . gen category = _n
    
    . tab category, gen(indcat)
    
       category |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          1        4.35        4.35
              2 |          1        4.35        8.70
              3 |          1        4.35       13.04
              4 |          1        4.35       17.39
              5 |          1        4.35       21.74
              6 |          1        4.35       26.09
              7 |          1        4.35       30.43
              8 |          1        4.35       34.78
              9 |          1        4.35       39.13
             10 |          1        4.35       43.48
             11 |          1        4.35       47.83
             12 |          1        4.35       52.17
             13 |          1        4.35       56.52
             14 |          1        4.35       60.87
             15 |          1        4.35       65.22
             16 |          1        4.35       69.57
             17 |          1        4.35       73.91
             18 |          1        4.35       78.26
             19 |          1        4.35       82.61
             20 |          1        4.35       86.96
             21 |          1        4.35       91.30
             22 |          1        4.35       95.65
             23 |          1        4.35      100.00
    ------------+-----------------------------------
          Total |         23      100.00
    
    .
    . ds indcat*
    indcat1   indcat6   indcat11  indcat16  indcat21
    indcat2   indcat7   indcat12  indcat17  indcat22
    indcat3   indcat8   indcat13  indcat18  indcat23
    indcat4   indcat9   indcat14  indcat19
    indcat5   indcat10  indcat15  indcat20

    help limits implies to me that if there is an upper limit, it arises otherwise.

    I ran this in Stata 18 and also in 16 with equivalent result.

    I have to guess that the problem lies in your data. You've shown us that 23 is a value in the data, so I don't have a quick explanation.

    Please show us the results of

    Code:
    contract kode_melder
    dataex
    (save your dataset first if changed)

    Comment


    • #3
      Here are the results of
      Code:
      contract kode_melder
      dataex
      Code:
      input byte kode_melder int _freq
       1  84
       2 264
       3 132
       4 122
       5 532
       6  85
       7 102
       8 509
       9  17
      10 258
      11 570
      12  16
      13  75
      14  63
      15 186
      16  33
      17  26
      18  11
      20   3
      21   4
      22  88
      23  201
      Looking at this output, it seems like the problem probably lies in the fact that kode_melder == 19 has a frequency of 0, thus making stata skip over that value. I am planning to merge this dataset with other datasets that likely have values for kode_melder == 19, what would be the best way to go about this so that that the melder* variables keep the correct numbers?

      Comment


      • #4
        do the -merge- first and then do your -tab , gen()- command; note that you don't say why you want these indicator variables but if you want to include them in some kind of model, you may be better off using factor variable notation rather than generating all these variables

        Comment


        • #5
          Sorry I did not specify, the reason that I am creating these indicator variables is because there are currently duplicates in the dataset of one of the variables that I need to use while merging, "id_lnr". This "id_lnr" has duplicates because each id_lnr can have several codes for the kode_melder variable, so all information in the duplicate is the same with the exception of the kode_melder variable. I am not able to merge with the other datasets until each id_lnr is a single variable, so I was trying to find a way to keep the information from all of the codes by creating indicator variables. My current process is to go through each dataset, clean it so that I end up with unique id_lnr variables, save it, then merge. I was just previously doing that in a very inefficient way, with tons of copy-paste code, so I was trying to streamline the process a bit by learning how to reduce the amount of code

          Comment


          • #6
            I cannot follow what the purpose is for these indicators, but as long as the minimum value and maximum value are always observed, here is another way to create the indicators that includes the empty categories.

            Code:
            qui sum kode_melder
            forval i= `r(min)'/`r(max)'{
                gen melder`i'= `i'.kode_melder
            }

            Comment


            • #7
              That was perfect Andrew, thank you!

              Comment


              • #8
                I agree with Rich Goldstein that the first preference is to use factor variable notation. Mary will know, but Rich won't, that this was a point made in comments at https://stackoverflow.com/questions/...tata-correctly

                See also dummieslab from SSC. This is a rather old command (most of the work done 2003/2004) but I suspect much of the original motivation was to have names for the indicators that made sense.

                Comment

                Working...
                X