String Variables Not Visible?

Rachel Simon

Join Date: Jun 2021

Posts: 9
#1

String Variables Not Visible?

30 Jun 2021, 09:49

Hi,

I am relatively new to using STATA so I apologize if this is something obvious.

I have a variable that has values like "ypN0" "cM0" "4A" - so I coded this as a string variable.

When I run the script, the columns for these variables come up entirely blank.

Just to make sure there was actual data in those columns, I switched out "str" for "long" - and got a very long list of errors saying, for example, "cM0 cannot be read as a number".
So I know the values are there, they're just not showing up.

What other data type could it be? It's not numeric, and string isn't working out. I did try "str5" as well since the max amount of characters for this data is 5, but still blank.

Thanks!
Tags: None
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#2

30 Jun 2021, 10:04

What "script"? Without knowing what you are doing, it is difficult to provide any insights.
Comment
Rachel Simon

Join Date: Jun 2021

Posts: 9
#3

30 Jun 2021, 10:16

Originally posted by Jeph Herrin View Post

What "script"? Without knowing what you are doing, it is difficult to provide any insights.

Sorry, it's a script for pulling coded information from a database, so I have to create/identify all the variables their types, label them, etc. So I have a long list of variables, but there's about 4 variables (all the same type, they describe cancer staging) that should be strings that just aren't appearing at all.

Example of a part of the list:

str CASE_ID 1- 37 ///
str FACILITY_ID 38- 47 ///
byte FACILITY_TYPE_CD 48- 48 ///
byte FACILITY_LOCATION_CD 49- 49 ///
int AGE 50- 52 ///
byte SEX 53- 53 ///

And so on.

But these cancer staging ones, ex: "str CANCER_STAGE #- # ///" just are not populating into STATA. I know the values are there because when I try to use float/int/long/byte, I get the error that it cannot be read as a number.

I hope this is more helpful!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10219
#4

30 Jun 2021, 11:26

It may just be the display format. Have you tried listing them?, e.g.,

Code:

list CANCER_STAGE* in 1/10
Comment
Rachel Simon

Join Date: Jun 2021

Posts: 9
#5

30 Jun 2021, 11:34

Originally posted by Andrew Musau View Post

It may just be the display format. Have you tried listing them?, e.g.,

Code:

list CANCER_STAGE* in 1/10

I tried listing as well, but the results are still blank. I also tried some other simple commands like tabulating to see what would come up, but it just returns with "no observations".

I do know that the data is there because I can see every entry when I try to run it as a byte (or other numeric variable type) - it will go through each entry and return with "cannot be read as a number" for each one.

But when I have it listed as a string variable, no errors come up, the columns are just empty.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10219
#6

30 Jun 2021, 11:40

I also tried some other simple commands like tabulating to see what would come up, but it just returns with "no observations"

This suggests that the variable names were imported, but not their entries. You have to go back to the import and figure out what went wrong.

I do know that the data is there because I can see every entry when I try to run it as a byte (or other numeric variable type) - it will go through each entry and return with "cannot be read as a number" for each one.

Not sure what you mean here. Is this when importing the variables? Can you show the exact commands that you ran?

Last edited by Andrew Musau; 30 Jun 2021, 11:46.
Comment
Rachel Simon

Join Date: Jun 2021

Posts: 9
#7

30 Jun 2021, 11:58

Originally posted by Andrew Musau View Post

This suggests that the variable names were imported, but not their entries. You have to go back to the import and figure out what went wrong.

Not sure what you mean here. Is this when importing the variables? Can you show the exact commands that you ran?

Yes sorry, when importing the variables.
All I theoretically should have to do to import the data for this variable is:

infix str CANCER_STAGE 1- 37 ///
using "C:\Users\me\datafile.DAT"

label var CANCER_STAGE "Cancer Stage"

When that didn't work out, I went back and edited it as:

byte CANCER_STAGE 1- 37 ///

Just to see what errors/entries would appear as a byte instead of a string.

When I run the script, I get at least a few thousand lines of:
'ypT1' cannot be read as a number for CANCER_STAGE[333294]
'ypT3' cannot be read as a number for CANCER_STAGE[333440]
'ypT1' cannot be read as a number for CANCER_STAGE[333452]
'ypT3' cannot be read as a number for CANCER_STAGE[333456]
'ypT0' cannot be read as a number for CANCER_STAGE[333457]
'ypT3' cannot be read as a number for CANCER_STAGE[333458]
for every entry.

I want to be able to see those "ypT#" values, that's exactly what I'm looking for - but they're just not there as a string, I only know they exist in the data set because of this error message.

I hope that makes sense. Since I'm new, my lingo isn't great.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10219
#8

30 Jun 2021, 12:41

That is strange. Is it a regular text file? If so, what do you get running the following?

Code:

import delimited using "C:\Users\me\datafile.DAT", clear rowrange(333294:333458)

Last edited by Andrew Musau; 30 Jun 2021, 12:43.
Comment

Vivek Gupta

Join Date: Sep 2016
Posts: 25

30 Jun 2021, 22:10

In case you are working with data from a SQL database, you can directly use ODBC based data loading.

I had used similar approach fro loading data from MySQL on a Win10 OS laptop,

Code:

version 12.1
clear
capture log close
log using DataLoadSDI.smcl, replace

// List of DBs
local dblist " india1 india3  india4 india6 india7  india8  india9  " // NAMES OF MYSQL Databases
 
foreach d in   `dblist' {
    cd "Data"
    odbc load, ///
      exec("SELECT * FROM `d'.sdi") ///
      conn("DRIVER={MySQL ODBC 5.3x64Unicode};SERVER=localhost;PORT=3308;DATABASE=`d';UID=root;PWD=root;") clear  sqlshow // MySQL ODBC 5.3x64Unicode is the name of the mySQL ODBC driver installed in the windows 10 laptop. The driver is called by its name
      save sdi_`d'.dta, replace
     count    
}

// Create Shell / Empty DTA File
use sdi_india1.dta, clear
drop in 1/-1
save "india_sdi_merged.dta", replace
count

// Append each database
foreach d in   `dblist'    {
    use sdi_`d'.dta, clear
    count
    append using india_sdi_merged.dta
    save india_sdi_merged.dta, replace
}
compress
count

// DATA MANAGEMENT
rename FirstName FirstName_Respondent
rename MI MI_Respondent
rename LastName LastName_Respondent
rename PatientNUmber PatientNumber_Respondent

// DEFINE Value Labels in another DO file and execute it
do "labels_demo.do"

// APPLY value Labels
label value area lbl_Area
label value sex lbl_Sex
label value edu lbl_edu

// FINAL SAVE
save india_sdi_merged.dta, replace

Also if there are multiple tables that require JOINs then rather than pass a complex query to ODBC LOAD, I had created MySQL Views in the database and then called the view directly using a simple Select * from ViewName query as shown above.

You can consult "help odbc" for import options that may be helpful or more applicable.

Hopefully this is helpful
Vivek

Stata 15.1 (MP 2 core)
https://www.epidemiology.tech/category/stata/
Google Scholar Profile

Comment

Vivek Gupta

Join Date: Sep 2016

Posts: 25
#10

30 Jun 2021, 22:19

Also the issue may be leading or training spaces ?
Have you tried to trim the data
can you share output of

Code:

des CANCER_STAGE codebook CANCER_STAGE gen CANCER_STAGE_2 = ustrtrim(stritrim(CANCER_STAGE))

Stata 15.1 (MP 2 core)
https://www.epidemiology.tech/category/stata/
Google Scholar Profile
Comment
Rachel Simon

Join Date: Jun 2021

Posts: 9
#11

01 Jul 2021, 05:57

Originally posted by Vivek Gupta View Post

Also the issue may be leading or training spaces ?
Have you tried to trim the data
can you share output of

Code:

des CANCER_STAGE codebook CANCER_STAGE gen CANCER_STAGE_2 = ustrtrim(stritrim(CANCER_STAGE))

variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CANCER_STAGE str15 %15s Cancer Stage

.
. codebook CANCER_STAGE

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CANCER_STAGE Cancer Stage
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

type: string (str15), but longest is str0

unique values: 0 missing "": 7,514/7,514

tabulation: Freq. Value
7,514 ""

.
. gen CANCER_STAGE_2 = ustrtrim(stritrim(CANCER_STAGE))
(7,514 missing values generated)

--------------

The values generated were still blank -
So it seems like the issue is in "but longest is str0"? Is there a workaround for these seeming to be "str0"?
I'm not sure exactly what that means or how to fix it.
Comment
Rachel Simon

Join Date: Jun 2021

Posts: 9
#12

01 Jul 2021, 06:36

Just to update, I may be out of luck if it's genuinely a "str0" variable, which is what it seems like. I can't find any info on a workaround or way to recode/edit the values or variable?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#13

01 Jul 2021, 07:03

if there isn't, there should be so I would write to tech-support; please report their response back here if you do write
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#14

01 Jul 2021, 09:38

I think Stata is having a problem reading your dataset; perhaps it is not pure text. The Stata hexdump command might help us figure out the problem. I'd like to see the output of the following two command when applied to your datafile.DAT; it will show us exactly what the first 100 bytes are.

Code:

. ls dumpexample.txt

-rw-r--r--@ 1 lisowskiw  staff  41 Jul  1 11:34 dumpexample.txt

. hexdump dumpexample.txt, to(100)
                 |                                         |    character
                 |           hex representation            |  representation
         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
               0 | 3132 3334 3536 3738 395f 5468 6973 2069 | 123456789_This i 
              10 | 7320 615f 7465 7374 3536 3738 395f 3132 | s a_test56789_12 
              20 | 3334 3536 3437 3839 5f                  | 34564789_        

.

Comment

Rachel Simon

Join Date: Jun 2021

Posts: 9
#15

01 Jul 2021, 11:33

Update! It is "fixed".
Thank you for all the help. All of the commands helped me pinpoint the problem and were very useful to learn as I get used to utilizing Stata.

I'm actually unsure at which point the problem was solved, because originally I was unable to see any values for the variable, but I also made a lot of edits in one go.

After that, "a few" values (or so I thought) populated after making various edits, so I was convinced I was still doing something wrong that could somehow be impeding all of the data getting imported.
Apparently it was a new variable introduced into the database, so even though I'm working with over 340,000 rows of data, there were only 8,000 values for this variable so far. A lot is just genuinely missing.
Comment

Announcement