Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with encoding a variable

    Hello everyone,

    I have a problem with data that I need to transform. I need to transform a string variable into a numeric variable. I then need to use this variable to do further calculations.

    Code:
    encode(ca_netpay),gen(ca_netpay_num) // convert string to long format
    label var ca_netpay_num "numerical ca_netpay_num"
    sort ca_netpay_num
    tab ca_netpay_num
    tab ca_netpay_num, missing nolabel // refind the numerical values behind labels
    *
    *set the variable to "missing" for all individuals where a numeric value is not available
    replace ca_netpay_num = . if ca_netpay_num >= 1614

    However, when for example I try to do the following calculation, I don't get all the values equal to 1 for the observations greater than 1000$. Why is this, could someone please help me.
    I have been struggling with something this "simple" for a while now:

    Code:
    bys ca_sex: summarize less_happy if ca_netpay_num  >1000
    Here a sample:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte ca_sex long ca_netpay_num float yes
    1  663 0
    1  726 0
    1    1 0
    1 1081 1
    1    3 0
    1  112 0
    1 1245 1
    1 1081 1
    1    . 1
    1 1081 1
    1  927 0
    1  623 0
    1  668 0
    1 1111 1
    1    . 1
    1 1081 1
    1    . 1
    1  278 0
    1 1487 1
    1 1081 1
    1  806 0
    1 1081 1
    1 1360 1
    1 1081 1
    1    . 1
    1  505 0
    1 1515 1
    1    2 0
    1    6 0
    1  853 0
    1 1487 1
    1    3 0
    1 1170 1
    1    3 0
    1    . 1
    1    . 1
    1  435 0
    1    . 1
    1  726 0
    1  107 0
    end
    label values ca_sex ca_sex
    label def ca_sex 1 "Male", modify
    label values ca_netpay_num ca_netpay_num
    label def ca_netpay_num 2 ".", modify
    label def ca_netpay_num 3 "0", modify
    label def ca_netpay_num 6 "100", modify
    label def ca_netpay_num 107 "1150", modify
    label def ca_netpay_num 112 "1156", modify
    label def ca_netpay_num 278 "1400", modify
    label def ca_netpay_num 435 "1670", modify
    label def ca_netpay_num 505 "1800", modify
    label def ca_netpay_num 623 "200", modify
    label def ca_netpay_num 663 "2075", modify
    label def ca_netpay_num 668 "2090", modify
    label def ca_netpay_num 726 "2200", modify
    label def ca_netpay_num 806 "2400", modify
    label def ca_netpay_num 853 "250", modify
    label def ca_netpay_num 927 "2750", modify
    label def ca_netpay_num 1081 "3300", modify
    label def ca_netpay_num 1111 "35000", modify
    label def ca_netpay_num 1170 "3890", modify
    label def ca_netpay_num 1245 "4500", modify
    label def ca_netpay_num 1360 "600", modify
    label def ca_netpay_num 1487 "800", modify
    label def ca_netpay_num 1515 "850", modify
    Last edited by Michael Duarte Goncalves; 23 Mar 2023, 14:36.

  • #2
    Just maybe a little clarification: the initial variable "ca_netpay" contained as string numerical values (1000, 98006, etc.), but also "words" (don't know, not applicable, etc.)

    Thank you in advance.
    Michael

    Comment


    • #3
      not enough in dataex to figure it out. include the string and the less_happy variable.

      Comment


      • #4
        Note that encode is not suitable here at all. It does not even preserve order. You should use destring instead.

        Comment


        • #5
          Hello George Ford,
          Thank you. Here it goes:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str12 ca_netpay long ca_netpay_num byte lesshappy float yes
          "2075"          663 1 0
          "2200"          726 0 0
          " "               1 0 0
          "3300"         1081 0 1
          "0"               3 1 0
          "1156"          112 . 0
          "4500"         1245 1 1
          "3300"         1081 0 1
          "inapplicable"    . 0 1
          "3300"         1081 0 1
          "2750"          927 0 0
          "200"           623 . 0
          "2090"          668 0 0
          "35000"        1111 0 1
          "refusal"         . 0 1
          "3300"         1081 0 1
          "don't know"      . 0 1
          "1400"          278 0 0
          "800"          1487 0 1
          "3300"         1081 0 1
          "2400"          806 0 0
          "3300"         1081 0 1
          "600"          1360 0 1
          "3300"         1081 0 1
          "refusal"         . 0 1
          "1800"          505 0 0
          "850"          1515 0 1
          "."               2 0 0
          "100"             6 0 0
          "250"           853 0 0
          "800"          1487 1 1
          "0"               3 0 0
          "3890"         1170 0 1
          "0"               3 0 0
          "inapplicable"    . . 1
          "missing"         . . 1
          "1670"          435 0 0
          "inapplicable"    . 0 1
          "2200"          726 0 0
          "1150"          107 0 0
          end
          label values ca_netpay_num ca_netpay_num
          label def ca_netpay_num 2 ".", modify
          label def ca_netpay_num 3 "0", modify
          label def ca_netpay_num 6 "100", modify
          label def ca_netpay_num 107 "1150", modify
          label def ca_netpay_num 112 "1156", modify
          label def ca_netpay_num 278 "1400", modify
          label def ca_netpay_num 435 "1670", modify
          label def ca_netpay_num 505 "1800", modify
          label def ca_netpay_num 623 "200", modify
          label def ca_netpay_num 663 "2075", modify
          label def ca_netpay_num 668 "2090", modify
          label def ca_netpay_num 726 "2200", modify
          label def ca_netpay_num 806 "2400", modify
          label def ca_netpay_num 853 "250", modify
          label def ca_netpay_num 927 "2750", modify
          label def ca_netpay_num 1081 "3300", modify
          label def ca_netpay_num 1111 "35000", modify
          label def ca_netpay_num 1170 "3890", modify
          label def ca_netpay_num 1245 "4500", modify
          label def ca_netpay_num 1360 "600", modify
          label def ca_netpay_num 1487 "800", modify
          label def ca_netpay_num 1515 "850", modify
          label values lesshappy lesshappy_val_lab
          label def lesshappy_val_lab 0 "Equal or More Happier", modify
          label def lesshappy_val_lab 1 "Equal or Much Less Happy", modify
          ca_netpay is the original string variable.
          Nick Cox: you mentionned I need to use destring, but how can I do that please?

          I receive an error message:
          Code:
          destring ca_netpay, gen(new)
          ca_netpay: contains nonnumeric characters; no generate
          Last edited by Michael Duarte Goncalves; 23 Mar 2023, 14:50.

          Comment


          • #6
            try this instead of encode
            Code:
            destring ca_netpay, gen(ca_netpay_num2) force  // convert string to long format

            Comment


            • #7
              George Ford:

              Incredible, it works perfectly well! Thanks a lot.
              Best,

              Michael

              Comment


              • #8
                Applying encode when destring is needed is one of the commonest mistakes in data management and with good reason, because the distinction is moderately subtle.

                encode is most needed when strings represent categories (non-numeric) (e.g. male, female; frog, toad) but you need a numeric coding.

                destring is most needed when most if not all of the values are really numeric, but somehow have been imported as string.

                A tell-tale sign of a misguided encode is that the correspondence between numbers and value labels makes no sense until you realise that values have been encoded in string sort order, so "20" sorts after "100", "300" sorts after "1000" and so forth. label dir is your friend to check on value labels.

                Why then were they imported as string? The answers range from stupidity (someone else's, naturally) to sensible or defensible action (Stata didn't really have a choice if it was to be faithful to your data). The answers include:

                1. Import from a spreadsheet or spreadsheet-like application in which metadata -- say from the first row -- have been accidentally (or even deliberately) read in as data. You can drop that observation but the variables will remain string.

                2. Other gunk from such an application. For example, people using spreadsheets often exploit the freedom to treat it as a scrapbook, adding odd bits of text hither and thither. Good idea if you are a spreadsheet user, not so good for export.

                3. Commas as decimal separators.

                4. Currency symbols.

                5. Non-Stataish ways of indicating missing, such as "NA", "not applicable", "refused to answer", "authentic frontier gibberish", and so forth.

                6. Ranges or intervals, such as 5-6, above 50, or whatever.

                7. Dates that Stata can't understand.

                8. Whatever else I've forgotten or never knew about.

                Check out Borges's spoof on classification if not familiar https://en.wikipedia.org/wiki/Celest...lent_Knowledge

                https://www.stata-journal.com/articl...article=dm0098 is a survey of this territory written by two contributors to Statalist.
                Last edited by Nick Cox; 23 Mar 2023, 17:15.

                Comment


                • #9
                  Dear Nick Cox:

                  Thank you for the excellent explanation. All is clear relative to encode/destring.

                  Have a nice day.
                  Best,

                  Michael

                  Comment


                  • #10
                    Addition to #8:

                    Non-numeric characters such as letter O or letter l when digit 0 or digit 1 was needed.

                    Comment


                    • #11
                      Hello Nick Cox. Thank you for #10.
                      Kind regards,

                      Michael

                      Comment

                      Working...
                      X