Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to make this correspondence dta?

    Original data has a string variable "country"

    country
    US
    US
    UK
    US
    Canada
    UK
    US
    Canada

    To reduce file size, I want to turn this into a numeric variable, something like
    country
    1
    1
    2
    1
    3
    2
    1
    3
    while also separately saving a file correspondence.dta that looks like
    numeric country
    1 US
    2 UK
    3 Canada

    Changing it to numeric to reduce the file size could be done by doing this.
    encode country, gen(country2)
    drop country
    gen country3=country2
    drop country2
    rename country3 country

    But since I am using country3, but not country2 (which has string label), this data now do not contain string information that I need although file size has been successfully reduced.
    I need that correspondence.dta file.
    How can I generate that while also simultaneously reducing file size like this?

  • #2
    Welcome to the Statalist. You may want to reconsider using label, rather than a "look-up" dataset (or translation table). If you do need this table, it's not clear why you need it from just your description of your problem, but then again, you probably don't need it.

    First off, Stata allows up to 65,536 unique label value levels, which can have a maximum length of 80 characters. If your data are within these limits (such as country names in your toy data example), then labels are the way to go.

    Code:
    clear
    input str6(country_name)
    US
    US
    UK
    US
    Canada
    UK
    US
    Canada
    end
    list
    
    encode country_name, gen(country)
    compress country
    label list country
    This creates a new variable called -country- taking integer values from 1 to 3, which have the following label attached.

    Code:
    . label list country
    country:
               1 Canada
               2 UK
               3 US
    You could also supply your own label if you cared about which values get mapped to which country.

    From here you can generate a dataset for the label mapping of -country- which can serve as your translation table.

    Code:
    uselabel country, clear
    list
    Result

    Code:
         +----------------------------------+
         |   lname   value    label   trunc |
         |----------------------------------|
      1. | country       1   Canada       0 |
      2. | country       2       UK       0 |
      3. | country       3       US       0 |
         +----------------------------------+
    If your data fall outside of the limits of what a label could hold, then the most direct way may be to use -egen, tag()-.

    Code:
    egen byte unique = tag(country_name)
    bysort unique (country_name) : gen country = sum(unique)
    drop unique
    keep if country
    qui compress
    list, abbrev(12)
    Result

    Code:
         +------------------------+
         | country_name   country |
         |------------------------|
      1. |       Canada         1 |
      2. |           UK         2 |
      3. |           US         3 |
         +------------------------+

    Comment


    • #3
      Thanks for your reply, and I wonder how can I save the table below to another dta file?
      Click image for larger version

Name:	2023-06-13 015754.png
Views:	1
Size:	28.6 KB
ID:	1716888

      Comment


      • #4
        Read the next sentence....

        Comment


        • #5
          First off, Stata allows up to 65,536 unique label value levels, which can have a maximum length of 80 characters.
          label define creates a value label named lblname, which is a set of individual numeric values
          and their corresponding labels. lblname can contain up to 65,536 individual labels; each individual
          label can be up to 32,000 characters long.

          Comment


          • #6
            Originally posted by Bjarte Aagnes View Post

            label define creates a value label named lblname, which is a set of individual numeric values
            and their corresponding labels. lblname can contain up to 65,536 individual labels; each individual
            label can be up to 32,000 characters long.
            Thanks for the correction, Bjarte. I misread the label length. Indeed 32,000 characters expands the utility of labels even more.

            Comment


            • #7
              Indeed 32,000 characters expands the utility of labels even more.
              Well, maybe. Doesn't the argument against Stata allowing variable names to be longer than 32 characters based on the difficulty/impossibility of displaying them in output apply equally well to value labels? And that argument has been raised against requests to go to 64 characters. It would be much more problematic with 32,000. Is there some use of value labels that I'm just not aware of which makes extremely long value labels more useful?

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Well, maybe. Doesn't the argument against Stata allowing variable names to be longer than 32 characters based on the difficulty/impossibility of displaying them in output apply equally well to value labels? And that argument has been raised against requests to go to 64 characters. It would be much more problematic with 32,000. Is there some use of value labels that I'm just not aware of which makes extremely long value labels more useful?
                I agree with you, Clyde, and my remarks in #6 were restricted to what I believe to be the use case in this thread. Specifically, I understood the use case to be (1) create a dictionary/look-up dataset that (2) uses minimal disk space. I don't believe that long labels or variable names are terribly useful for any display purposes, in general. Here, OP does not say that they want to display this information, only to store it. I thought to use labels as a concise way of doing this that wouldn't later require some use of -merge- (or the like) to recover the encoded data.

                Comment


                • #9
                  Cross-posted at https://stackoverflow.com/questions/...-following-dta

                  Please note our longstanding request to tell people about cross-posting elsewhere.

                  Comment

                  Working...
                  X