Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recoding text/string value to a number

    Hi all,

    I am attempting to change 30-40 text/string values to numeric values for five variables. For example, one of the variables asks "how many people live in your household?". The vast majority of respondents entered "1" or "5", but a few responded "one" or "five". In this example, I'd like to replace the response "one" with "1".

    I know that I can change these values using the replace command for individual cases, but as we have over 7000 respondents it'd be great if there's a quicker alternative. I've tried to use the recode command (e.g., recode [var] ("One"=1) and replace command (e.g., replace [var]=1 if [var]=="One") but haven't had any luck. Any help would be greatly appreciated.

    Thank you!
    Matt

  • #2
    Just directly applying -recode- to this variable is likely to get you in trouble. "One" is unlikely to be recoded as 1, and those values that already look like numbers are likely to get scrambled to different numerical values. What you have here is the worst of both worlds: a variable that has both things that look like numbers stored in strings, and non-numerical strings that need to be translated into numbers. So I think this has to be built up in steps.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 var1
    "One"  
    "1"    
    "2"    
    "3"    
    "Two"  
    "4"    
    "5"    
    "Six"  
    "7"    
    "eight"
    end
    
    gen var2 = lower(var1)    // RESOLVE ANY CASE ISSUES
    destring var2, gen(wanted) force
    replace var2 = "" if !missing(wanted)
    label define numbers    1    "one"    ///
                            2    "two"    ///
                            3    "three"    ///
                            4    "four"    ///
                            5    "five"    ///
                            6    "six"    ///
                            7    "seven"    ///
                            8    "eight"    ///
                            9    "nine"    ///
                            10    "ten"
    encode var2, gen(nvar2) label(numbers)
    replace wanted = nvar2 if missing(wanted)
    
    list var1 wanted, noobs clean
    Now, this is not a general algorithm for translating numbers written as words into numerical notation. It only works for 1 through 10. You might need to add larger numbers, but given that the variable in question is household size, I assume this won't get too far out of hand.

    In the future, when asking for code, please post example data. In this case it was easy enough to make up an example, but most of the time it is best to work with an example from the real data, just in case there are complications in the data that aren't apparent from the description.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      My bet would be that your data originated as an import from some spreadsheet. Spreadsheets will allow text and numbers to appear in the same column of different rows. I have never seen a statistical program that would allow this. So, when Stata finds a column from a spreadsheet that has both numbers such as 1 and non-numeric text such as "one" in the same column, it adapts by presuming that everything in the column is some kind of text (string), so it makes a string variable, which can hold both numerals like "1" (not 1) and text like "one."

      If I had a quarter for everyone I had helped with this sort of problem, I'd be rich :-} The most subtle version of it is when a cell of the spreadsheet contains a "space" character, which we might think is missing, but which Stata will not. Stata thinks a space is a non-numeric character and so makes a string variable. The solution here is to never mix numbers and text in a column of a spreadsheet. f course, you likely didn't prepare this spreadsheet---some online questionnaire bot probably allowed respondents to enter non-numeric stuff in where a number was what you wanted. Something like what Clyde showed is the only fix for this.

      Comment


      • #4
        Clyde, thank you so much for the code. I imagined it would be an easy fix, so I really appreciate you taking the time to write out the code in full.

        Mike, you are right. The data were imported from Excel. It was originally a Qualtrics survey. It's a good learning process for me (and one that I've already passed on to colleagues!). I certainly won't forget to restrict the field to numerical characters in the future.

        Thanks,
        Matt

        Comment


        • #5
          For what it's worth, my experience with data sets created in Qualtrics is that they are typically funky. Whenever I sign on to a project where they have already created a Qualtrics database, I know I'm in for a data management nightmare. The front end of Qualtrics, which creates the database from a survey and accepts data input, is actually very friendly to the author and data entry person. That's why those people really like Qualtrics and it probably accounts for their strong presence in the market.

          But the datasets it produces on the back end are unwieldy in many ways and difficult for analysts to work with. Whenever I sign on to a project where that decision has not been made, or is still reversible, I lobby strenuously for alternatives. (In my opinion, of the commercially available ones, REDCap is best. It's not perfect, but it's very, very good.)

          Just sayin'.

          Comment

          Working...
          X