How to split lines into separate ones?

Sang-Min Park

Join Date: Mar 2018

Posts: 11
#1

How to split lines into separate ones?

06 Feb 2021, 11:51

Dear all,

My data has about 200,000 observations and it has two distinct ids (Rank and NCTId), Date, Location Country, Location Zip, and other variables. I have interest in counting "United States" zip codes by year.

I think the cases with only one US zip code is no problem. But some observations have multiple locations as seen below. There are quite various cases: US, US|US... (divided by '|'), US|US|other country (or multiple countries; in some cases, more than 10 countries are combined). In those cases, I would like to separate or split the line into additional lines with separate zip codes.

For example, the first line in the below data would be divided into two separate lines with different (Location country and) Location Zip while sharing other variables such as Rank, NCTId, OrgStudyId, Date, and etc. Another example, I would like to make the fourth line into separate two lines of US and Canada zip codes. Then, I could count the US zip codes.

Thank you.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str6 Rank str11 NCTId str32 OrgStudyId str18 StudyFirstPostDate strL(LocationCountry2 LocationZip) "364383" "NCT00002293" "026B" "August 31, 2001" "United States|United States" "35233|085434000" "199477" "NCT02142465" "LAZ-012-0002SA" "May 20, 2014" "" "" "206224" "NCT02054260" "SCHBC_IRB_2013-40" "February 4, 2014" "" "420-767" "342857" "NCT00253617" "AXCAN-PHOCCC04-01" "November 15, 2005" "United States|Canada" "90095-1781|J3H 6C4" "255087" "NCT01414270" "1108-V19_PT03-01" "August 11, 2011" "" "" "262783" "NCT01313195" "C-10-028" "March 11, 2011" "" "" "208166" "NCT02028897" "SAIRB-13-0040" "January 7, 2014" "United States" "89117" end
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

06 Feb 2021, 14:53

I think this sample code will do what you want. The variable seq is created in order to sort the results into the same order as the input to make it easier to compare.

Code:

generate seq = _n
split LocationCountry2, parse(|)
split LocationZip, parse(|)
describe, fullnames
drop LocationCountry2 LocationZip
reshape long LocationCountry2 LocationZip, i(Rank NCTId) j(c2num)
sort seq
drop seq
drop if missing(LocationCountry2,LocationZip) & c2num>1
list Rank NCTId c2num LocationCountry2 LocationZip, sepby(Rank NCTId) abbreviate(20)

Code:

. generate seq = _n

. split LocationCountry2, parse(|)
variables created as string: 
LocationC~21  LocationC~22

. split LocationZip, parse(|)
variables created as string: 
LocationZip1  LocationZip2

. describe, fullnames

Contains data
  obs:             7                          
 vars:            11                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
Rank            str6    %9s                   
NCTId           str11   %11s                  
OrgStudyId      str32   %32s                  
StudyFirstPostDate
                str18   %18s                  
LocationCountry2
                strL    %9s                   
LocationZip     strL    %9s                   
seq             float   %9.0g                 
LocationCountry21
                str13   %13s                  
LocationCountry22
                str13   %13s                  
LocationZip1    str10   %10s                  
LocationZip2    str9    %9s                   
------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. drop LocationCountry2 LocationZip

. reshape long LocationCountry2 LocationZip, i(Rank NCTId) j(c2num)
(note: j = 1 2)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        7   ->      14
Number of variables                   9   ->       8
j variable (2 values)                     ->   c2num
xij variables:
    LocationCountry21 LocationCountry22   ->   LocationCountry2
              LocationZip1 LocationZip2   ->   LocationZip
-----------------------------------------------------------------------------

. sort seq

. drop seq

. drop if missing(LocationCountry2,LocationZip) & c2num>1
(5 observations deleted)

. list Rank NCTId c2num LocationCountry2 LocationZip, sepby(Rank NCTId) abbreviate(20)

     +---------------------------------------------------------------+
     |   Rank         NCTId   c2num   LocationCountry2   LocationZip |
     |---------------------------------------------------------------|
  1. | 364383   NCT00002293       1      United States         35233 |
  2. | 364383   NCT00002293       2      United States     085434000 |
     |---------------------------------------------------------------|
  3. | 199477   NCT02142465       1                                  |
     |---------------------------------------------------------------|
  4. | 206224   NCT02054260       1                          420-767 |
     |---------------------------------------------------------------|
  5. | 342857   NCT00253617       1      United States    90095-1781 |
  6. | 342857   NCT00253617       2             Canada       J3H 6C4 |
     |---------------------------------------------------------------|
  7. | 255087   NCT01414270       1                                  |
     |---------------------------------------------------------------|
  8. | 262783   NCT01313195       1                                  |
     |---------------------------------------------------------------|
  9. | 208166   NCT02028897       1      United States         89117 |
     +---------------------------------------------------------------+

Comment

Sang-Min Park

Join Date: Mar 2018

Posts: 11
#3

06 Feb 2021, 20:54

Dear William,
Thank you so much. I just applied your code to a small set of the data and found that it works.
Sang-Min
Comment

Announcement

How to split lines into separate ones?

Comment

Comment