Dropping repeat characters of a long string in Stata (left-truncation)

Mina Alma

Join Date: Nov 2023

Posts: 1
#1

Dropping repeat characters of a long string in Stata (left-truncation)

17 Nov 2023, 06:40

I have a dataset of what used to be ~8000 rows of a long 80 character string variable. I split up the string so that every row is now one character of the string where caseid identifies the individual (which used to be one row when the string was still together) and method is the name of the variable with the strings split up. Here is an example of what my string looked like before I separated it: ‘1110000333333333333333333333333NNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNN’ and now this string is split between rows 1-67 with CASE ID 1 10 1, for example. I need to drop all the characters that are at the end of the string that repeat for each case ID. In this example, I would drop rows 36-67 that repeat the N for Case ID 1 10 1. The characters that can be dropped are 1, 2, 3, 5, 6, 7, 8, 9, C, E, L, M, N, S, W. However, the number of characters (i.e., number of rows) I would drop for each individual varies as one person may only have 3 characters that repeat, for example, and another person may have 20, and someone else the whole string is the same character and therefore I would drop the individual/ all the rows. Essentially, I am trying to left-truncate my data. I would like to create a new variable of these dropped characters called dropped_characters to see which rows /characters were dropped. I tried doing this when the string as still together (i.e., before I separated it out into different rows) and it didn’t work. So now I am trying to do it with the string already split up. I have tried various versions of the code below but I keep getting a syntax error and it properly create dropped_characters

* Create a new variable to identify the characters to be dropped

gen dropped_characters = ""

* Identify groups based on caseid

egen group_id = group(caseid)

* Loop through each group

levelsof group_id, local(groups)

foreach group in `groups' {

* Loop through each character to identify and drop the unnecessary ones

forval i = 1/`max_rows' {

local char : substr(method, `i', 1)

if inlist("`char'", "1", "2", "3", "5", "6", "7", "8", "9", "C", "E", "L", "M", "N", "S", "W") {

* Append to the variable indicating dropped characters

replace dropped_characters = dropped_characters + "`char'" if group_id == `group'

* Set the character to missing

replace method = subinstr(method, "`char'", "", `i') if group_id == `group'

}

}

}

* Drop empty rows

drop if missing(method)

* Display the dropped characters for each case ID

tabulate caseid dropped_characters, gen(dropped_char_summary)

* Save the cleaned dataset

save cleaned_data.dta, replace
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#2

17 Nov 2023, 08:07

You attempted a solution to your problem, got stuck and now you are asking for help on your attempted solution. See the XY problem. We'd much rather have a go at solving your actual problem. You have shown an example of one of the string observations. Perhaps show us one or two more and illustrate what you want the observations to look like after applying the string manipulations.
Comment

Announcement

Dropping repeat characters of a long string in Stata (left-truncation)

Comment