Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating string variables

    Hi,
    I have data on the employment level by industry and I want to enter them into Stata via the data editor. I am not sure as to whether I should choose to enter the industry classifications (NACE) with their names as a string variable and then encode this if I need to when actually using the dataset or whether I should number the NACE categories and enter the variable as a float with 1- n categories and just create and attach value labels to these (meaningless) categories.

    Lastly, can string variables be used in any analysis in this form or they always have to be encoded beforehand.

    Many thanks in advance

  • #2
    Numeric variables with value labels are almost always easier to enter and to check than string variables. In the latter case, small spelling or punctuation errors can lead to all sorts of small and large problems. They can also appear in model commands as predictors (e.g. as factor variables). Some kinds of graphs won't accept string variables as defining an axis, but that is likely to bite much less.

    There is plenty of guidance on this in every introduction to Stata I know, including [U].

    Comment


    • #3
      It's not often that I disagree with Nick Cox, but

      Numeric variables with value labels are almost always easier to enter and to check than string variables. In the latter case, small spelling or punctuation errors can lead to all sorts of small and large problems.
      That's true, but there is also the problem of accurately, consistently translating the string value to the corresponding numeric value. Human data-enterers are prone to mistakes doing this, and the process of consulting a string-to-numeric crosswalk table also slows down the data entry process considerably. I have generally worked by the principle that data entry should be straight transcription: any transformations that need to be made should be made by a computer, not a human.

      So my approach would be to enter the data in the exact form that you have them. Then carefully clean the classifications and names to eliminate the typos and other errors that occur. Then use -encode- to transform those to numeric variables with value labels. (I certainly agree with Nick that numeric variables with value labels are far more useful for analysis than string variables.)

      Comment


      • #4
        We can converge easily. My experience is that some users have more problems typing strings in correctly and consistently than one could possibly guess, and that's time-consuming too.

        But Clyde is totally correct that typing in the wrong number but one that has a value label defined is an insidious error, and I've seen that often too.

        I guess the overarching tacit advice should be spelled out too: Any data you care about (which should mean any data) should be checked at data entry stage, ideally by methods that maximise the independence of the check, e.g. a completely separate entry, if possible by a different person. Also, print out the data and compare with the original source.
        Last edited by Nick Cox; 18 Jul 2014, 09:17.

        Comment


        • #5
          I guess the overarching tacit advice should be spelled out too: Any data you care about (which should mean any data) should be checked at data entry stage, ideally by methods that maximise the independence of the check, e.g. a completely separate entry, if possible by a different person. Also, print out the data and compare with the original source.

          Couldn't agree more! Convergence achieved.

          Comment


          • #6
            many thanks to both for your valuable advice

            Comment

            Working...
            X