Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting dataset into a standard text file with 100 positions per case

    Hello,

    I have a dataset (with 18 string and numeric variables) which I need to turn into a standard text file, with 100 positions per record (row), with no tabs, commas, delimiters, or other punctuation. I wanted to know if I can use Stata to turn my dataset into a text file and to position each data element in exact record positions (described below):

    (1-20) LAST NAME
    (21-35) FIRST NAME
    (36) MIDDLE INITIAL
    (37-45) SOCIAL SECURITY NUMBER
    (46-47) MONTH OF BIRTH
    (48-49) DAY OF BIRTH
    (50-53) YEAR OF BIRTH
    (54-71) FATHER’S SURNAME
    (72) AGE UNIT (at death)
    (73-74) NUMBER OF AGE UNITS (at death)
    (75) SEX
    (76) RACE
    (77) MARITAL STATUS
    (78-79) STATE OF RESIDENCE
    (80-81) STATE OF BIRTH
    (82-91) CONTROL/ID NUMBER
    (92-97) OPTIONAL USER DATA
    (98-100) BLANK FIELD


    Here are two examples of what a record with 100 positions following the position of each variable would look like:

    WASHINGTON GEORGE T28516085604081731WASHINGTON 075121013059A536a58 060183

    ROOSEVELT ELEANOR 26270217111231926ARIAS 029211NVCA62345 122296


    Thank you!!


  • #2
    Seems like you are interested in string concatenation, although I do not know exactly what your reference to "positions" implies beyond the desired order. Extra spaces in case some field is missing or is not of maximum length? See the -concat()- function of egen.

    Code:
    help egen
    Last edited by Andrew Musau; 08 Aug 2022, 11:19.

    Comment


    • #3
      I think it is string concatenation AND formatting. For example, DAY OF BIRTH must occupy positions (character offsets from beginning) 48-49 in the resulting string. Hence if the day is 1, it is to be included as 01 to fill the two allocated positions and maintain the original numeric value.

      What to do with missings (as Andrew already asked) or with too long string content is entirely dependent on the context, which I fail to see here, but is, perhaps, obvious to Daniela Kaiser .

      It looks to me like an attempt to generate data input to some legacy system (from the punch cards era). In which case I believe the following conventions are applicable:
      - string content that is too long is truncated at capacity;
      - string content that is too short is padded at the right with spaces (spaces AFTER the content)
      - numeric content that is too short is padded at the left with zeroes (zeroes BEFORE the content, aka leading zeroes)
      - floating numbers are rounded to number of digits permitted for storage (e.g. 2 or other number of decimals) and padded with zeroes from both sides: e.g. "12.3989" will become "0012.40" if there are two max digits for decimals and field total width is 7 positions.
      - etc

      Concatenate the resulting formatted fields.
      It makes sense to generate the 18 formatted string vars separately before concatenation to inspect and assert the formatting was done properly.

      Best, Sergiy

      Comment


      • #4
        Thank you both so much, concatenation definitely makes sense for this!

        What I'm not sure about now is, before concatenating, how to impose the character limitations on each variable. For instance, I want to force the variables last name to be 20 characters long, so if a last name is 8 letters long I need it to have 12 spaces after it. Do you happen to know how I can do this?

        Thanks!

        Comment


        • #5
          Code:
          clear
          input str29(firstname middlename lastname SSN)
          "Florence" "Griffith" "Joyner" "566-02-XXXX"
          "Usain" "" "Bolt" "589-64-XXXX"
          end
          
          *(1-20) LAST NAME
          *(21-35) FIRST NAME
          *(36-36) MIDDLE INITIAL
          *(37-45) SOCIAL SECURITY NUMBER
          
          gen wanted= substr(lastname +"                    ", 1, `=20-1+1')+ ///
                          substr(firstname +"                    ", 1, `=35-21+1')+ ///
                              substr(middlename +"                    ", 1, `=36-36+1')+ ///
                                  substr(SSN +"                    ", 1, `=45-37+1')
          Res.:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str45 wanted
          "Joyner              Florence       G566-02-XX"
          "Bolt                Usain           589-64-XX"
          end
          Last edited by Andrew Musau; 08 Aug 2022, 17:24.

          Comment

          Working...
          X