Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to split lines into separate ones?

    Dear all,

    My data has about 200,000 observations and it has two distinct ids (Rank and NCTId), Date, Location Country, Location Zip, and other variables. I have interest in counting "United States" zip codes by year.

    I think the cases with only one US zip code is no problem. But some observations have multiple locations as seen below. There are quite various cases: US, US|US... (divided by '|'), US|US|other country (or multiple countries; in some cases, more than 10 countries are combined). In those cases, I would like to separate or split the line into additional lines with separate zip codes.

    For example, the first line in the below data would be divided into two separate lines with different (Location country and) Location Zip while sharing other variables such as Rank, NCTId, OrgStudyId, Date, and etc. Another example, I would like to make the fourth line into separate two lines of US and Canada zip codes. Then, I could count the US zip codes.

    Thank you.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 Rank str11 NCTId str32 OrgStudyId str18 StudyFirstPostDate strL(LocationCountry2 LocationZip)
    "364383" "NCT00002293" "026B"              "August 31, 2001"   "United States|United States" "35233|085434000"   
    "199477" "NCT02142465" "LAZ-012-0002SA"    "May 20, 2014"      ""                            ""                  
    "206224" "NCT02054260" "SCHBC_IRB_2013-40" "February 4, 2014"  ""                            "420-767"           
    "342857" "NCT00253617" "AXCAN-PHOCCC04-01" "November 15, 2005" "United States|Canada"        "90095-1781|J3H 6C4"
    "255087" "NCT01414270" "1108-V19_PT03-01"  "August 11, 2011"   ""                            ""                  
    "262783" "NCT01313195" "C-10-028"          "March 11, 2011"    ""                            ""                  
    "208166" "NCT02028897" "SAIRB-13-0040"     "January 7, 2014"   "United States"               "89117"             
    end

  • #2
    I think this sample code will do what you want. The variable seq is created in order to sort the results into the same order as the input to make it easier to compare.
    Code:
    generate seq = _n
    split LocationCountry2, parse(|)
    split LocationZip, parse(|)
    describe, fullnames
    drop LocationCountry2 LocationZip
    reshape long LocationCountry2 LocationZip, i(Rank NCTId) j(c2num)
    sort seq
    drop seq
    drop if missing(LocationCountry2,LocationZip) & c2num>1
    list Rank NCTId c2num LocationCountry2 LocationZip, sepby(Rank NCTId) abbreviate(20)
    Code:
    . generate seq = _n
    
    . split LocationCountry2, parse(|)
    variables created as string: 
    LocationC~21  LocationC~22
    
    . split LocationZip, parse(|)
    variables created as string: 
    LocationZip1  LocationZip2
    
    . describe, fullnames
    
    Contains data
      obs:             7                          
     vars:            11                          
    ------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ------------------------------------------------------------------------------------------------
    Rank            str6    %9s                   
    NCTId           str11   %11s                  
    OrgStudyId      str32   %32s                  
    StudyFirstPostDate
                    str18   %18s                  
    LocationCountry2
                    strL    %9s                   
    LocationZip     strL    %9s                   
    seq             float   %9.0g                 
    LocationCountry21
                    str13   %13s                  
    LocationCountry22
                    str13   %13s                  
    LocationZip1    str10   %10s                  
    LocationZip2    str9    %9s                   
    ------------------------------------------------------------------------------------------------
    Sorted by: 
         Note: Dataset has changed since last saved.
    
    . drop LocationCountry2 LocationZip
    
    . reshape long LocationCountry2 LocationZip, i(Rank NCTId) j(c2num)
    (note: j = 1 2)
    
    Data                               wide   ->   long
    -----------------------------------------------------------------------------
    Number of obs.                        7   ->      14
    Number of variables                   9   ->       8
    j variable (2 values)                     ->   c2num
    xij variables:
        LocationCountry21 LocationCountry22   ->   LocationCountry2
                  LocationZip1 LocationZip2   ->   LocationZip
    -----------------------------------------------------------------------------
    
    . sort seq
    
    . drop seq
    
    . drop if missing(LocationCountry2,LocationZip) & c2num>1
    (5 observations deleted)
    
    . list Rank NCTId c2num LocationCountry2 LocationZip, sepby(Rank NCTId) abbreviate(20)
    
         +---------------------------------------------------------------+
         |   Rank         NCTId   c2num   LocationCountry2   LocationZip |
         |---------------------------------------------------------------|
      1. | 364383   NCT00002293       1      United States         35233 |
      2. | 364383   NCT00002293       2      United States     085434000 |
         |---------------------------------------------------------------|
      3. | 199477   NCT02142465       1                                  |
         |---------------------------------------------------------------|
      4. | 206224   NCT02054260       1                          420-767 |
         |---------------------------------------------------------------|
      5. | 342857   NCT00253617       1      United States    90095-1781 |
      6. | 342857   NCT00253617       2             Canada       J3H 6C4 |
         |---------------------------------------------------------------|
      7. | 255087   NCT01414270       1                                  |
         |---------------------------------------------------------------|
      8. | 262783   NCT01313195       1                                  |
         |---------------------------------------------------------------|
      9. | 208166   NCT02028897       1      United States         89117 |
         +---------------------------------------------------------------+

    Comment


    • #3
      Dear William,
      Thank you so much. I just applied your code to a small set of the data and found that it works.
      Sang-Min

      Comment

      Working...
      X