Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting a Continuous Variable to an Ordinal Variable (For Large Numbers)

    Hi, I have a continuous variable (median household income) from my census data that I am trying to convert into an ordinal variable (i.e. I'm trying to make median household income groups). I attempted to split up the variable (FSA_medincome_household) into 12 groups based on an incremental income of $10,000, using this code:

    egen FSA_medincome_household_743ord = cut(FSA_medincome_household_743), at(10000(10000)120000) label


    The problem here is that all my variables get coded as missing. I am not sure if it has something to do with the large numbers for the variable or because it's represented as a long storage type.


    Can anyone help me figure out what I am doing wrong?

    Thanks in advance!

  • #2
    The following example based on the single command you have shown us works as expected.
    Code:
    . describe medinc
    
                  storage   display    value
    variable name   type    format     label      variable label
    ------------------------------------------------------------------------------------------------
    medinc          long    %12.0g                
    
    . egen incord = cut(medinc), at(10000(10000)120000) label
    (2 missing values generated)
    
    . generate incordnl = incord
    (2 missing values generated)
    
    . list, clean
    
           medinc    incord   incordnl  
      1.     5000         .          .  
      2.    15000    10000-          0  
      3.    25000    20000-          1  
      4.    35000    30000-          2  
      5.    45000    40000-          3  
      6.    55000    50000-          4  
      7.    65000    60000-          5  
      8.    75000    70000-          6  
      9.    85000    80000-          7  
     10.    95000    90000-          8  
     11.   105000   100000-          9  
     12.   115000   110000-         10  
     13.   125000         .          .  
    
    .
    But perhaps the problem is in your data. Let me guess one very common cause. Your Census data came from a text file or an Excel worksheet or something similar, and for some reason the values were imported as a string variable. And then you used encode to convert the string variable to a numeric variable.

    Was that a good guess?

    If so, that's the source of your problem. The encode command is designed for assigning numerical codes to non-numeric strings like "France", "Germany", "United States". The output of help encode instructs us

    Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring; see real() or [D] destring.
    So if you were using encode where you should have used destring, you need to go back to your original data and correctly convert the strings to numbers.

    If that wasn't a good guess, then your problem really isn't clear without more detail, or at a minimum it is too difficult to guess at a good answer from what you have shared. Please help us help you. Show example data. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output. In particular, it's particularly helpful to use the dataex command to provide sample data, as described in section 12 of the FAQ.

    Comment


    • #3
      William Lisowski gives excellent advice and here's some more. As documented at (e.g.) https://www.stata-journal.com/articl...article=dm0095 and as evident also from the function definitions


      Code:
      gen bin = 10000 * floor(whatever / 10000)
      and

      Code:
      gen bin = 10000 * ceil(whatever / 10000)
      are direct ways to get bins of width 10000 with lower or upper bin limits as given by floor() or ceil() respectively.

      This method has various direct advantages, including

      1. Missing values are mapped to missing, as should usually be desired.

      2. The bins are self-documenting, which is good for graphs and tables. That doesn't rule out fancier value labels if desired so long as the limits are integers.

      3. There is less or even no need to fuss about what the overall range is, once you have decided on a bin width.

      4. The floor and cei[ing] definitions make it evident -- including to non-Stata users who might read your code -- what happens at bin limits.

      Evidently this applies "whatever value of 10000 you use", to adapt a comment attributed to William Feller. .


      Comment

      Working...
      X