Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert multiple variables from str to int respecting regularities (with more than 18million of unique values needed)

    Dear all,
    I have a dataset of two string variables: Input_address and Output_address. I need to convert them from string to int to save memory (about 100million rows).
    The point is that some Input_address and output_address are the same, and I want them to remain the same also after encode the variables.
    I mean, if Input_address (first column) is "hanb23bd4.."*, and in Output_address I have the same "hanb23bd4.."*, I want them to be converted into the same number once run encode.
    I don't know how to do that.

    *fictitious example

    This was my previous post, and I received a good suggestion of using "multencode".
    The problem is that I quickly reach the limit of unique values of encode (65 thousands more ore less, and I need approximately 18million of unique values).
    How can solve it? I need something that works exactly in the same way of multencode, but with much more unique values.
    I also have the need to export the resulting dataset in a .csv file, but doing multencode and exporting as a .csv I still have a file with the string instead of the integer.
    Thank you in advance,
    Marco
    Last edited by Marco Ven; 25 May 2022, 03:28.

  • #2
    18 million distinct values cannot be stored in an int, for that you need a long.

    Here is an attempt. With two reshapes it is probably slow in large datasets, which seem to be what you have.

    Code:
    // create some example data
    clear
    input str2 (addr_orig addr_dest)
    "a" "g"
    "b" "h"
    "c" "i"
    "d" "a"
    "b" "i"
    end
    
    // create the new variables
    gen id = _n
    reshape long addr_, i(id) j(what) string
    sort addr_
    by addr_ : gen long code = _n == 1
    gen byte touse = code
    replace code = sum(code)
    
    mata:
    st_sview(lab="",.,"addr_", "touse")
    st_view(val=.,.,"code","touse")
    st_vlmodify("code_lb",val,lab)
    end
    
    label values code code_lb
    drop touse
    reshape wide addr_ code, i(id) j(what) string
    
    list
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Originally posted by Maarten Buis View Post
      18 million distinct values cannot be stored in an int, for that you need a long.

      Here is an attempt. With two reshapes it is probably slow in large datasets, which seem to be what you have.

      Code:
      // create some example data
      clear
      input str2 (addr_orig addr_dest)
      "a" "g"
      "b" "h"
      "c" "i"
      "d" "a"
      "b" "i"
      end
      
      // create the new variables
      gen id = _n
      reshape long addr_, i(id) j(what) string
      sort addr_
      by addr_ : gen long code = _n == 1
      gen byte touse = code
      replace code = sum(code)
      
      mata:
      st_sview(lab="",.,"addr_", "touse")
      st_view(val=.,.,"code","touse")
      st_vlmodify("code_lb",val,lab)
      end
      
      label values code code_lb
      drop touse
      reshape wide addr_ code, i(id) j(what) string
      
      list

      Can you please tell me why I performed your example and worked well, while with my dataset I obtain this error?
      Of course the variable id does not uniquely idenfity the observations, but how can I manage it?

      reshape long addr_, i(id) j(what) string
      (j = dest orig)
      variable id does not uniquely identify the observations
      Your data are currently wide. You are performing a reshape long. You specified i(id) and j(what). In the current wide form,
      variable id should uniquely identify the observations. Remember this picture:

      long wide
      +---------------+ +------------------+
      | i j a b | | i a1 a2 b1 b2 |
      |---------------| <--- reshape ---> |------------------|
      | 1 1 1 2 | | 1 1 3 2 4 |
      | 1 2 3 4 | | 2 5 7 6 8 |
      | 2 1 5 6 | +------------------+
      | 2 2 7 8 |
      +---------------+
      Type reshape error for a list of the problem observations.
      Last edited by Marco Ven; 25 May 2022, 09:36.

      Comment


      • #4
        Did you create the variable id using the code I gave you?
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Yes, I used the code you gave me.
          Sorry for the late answer.
          I solved the problem using Python instead of Stata.
          Thank you once again,
          Marco

          Comment

          Working...
          X