Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Opening and cleaning SPSS -> STATA

    Dear Statalist,

    I hope this finds you well and rested after the weekend.
    I've recently been trying to generate a Stata dataset from SPSS files (.DAT & .SPS) the files are generated from a tertiary database software called OpenClinica is the progenitor and this is the code I run (please note this also needs to have - ingap.ado - installed).

    * Generate Variable labels:
    local SpsFile : dir . files "*.sps", respect
    insheet using `SpsFile', delimiter(" ") clear

    gen Keep = 1 if v1=="VARIABLE" & v2=="LABELS"
    replace Keep=0 if v1=="VALUE" & v2=="LABELS"
    replace Keep=Keep[_n-1] if Keep==.
    keep if Keep==1
    keep if v3=="/" | v4=="/"
    drop if v1=="VARIABLE" & v2=="LABELS"

    capture noisily assert _N==0
    if _rc==0 {
    set obs 1
    gen CodeVar = "* No Variable Labels defined, or error in do file, please check"
    di as err "No Variable Labels defined, or error in do file, please check"
    pause
    }

    capture gen CodeVar=""
    replace v3=v2 if v3=="/"
    replace CodeVar= "lab var " + v1 +`" ""' + v3 + `"""'
    list v1 v2 v3 v4 CodeVar
    keep CodeVar
    ingap
    replace CodeVar=`"* Generate Variable labels from `SpsFile' "' in 1
    save VariableLabels.dta, replace

    * Generate value labels:
    insheet using `SpsFile', delimiter(" ") clear
    gen Keep = 1 if v1=="VARIABLE" & v2=="LABELS"
    replace Keep=0 if v1=="VALUE" & v2=="LABELS"
    replace Keep=Keep[_n-1] if Keep==.
    keep if Keep==0
    drop if v1=="VALUE" & v2=="LABELS"
    drop if v1=="."
    drop if v1=="EXECUTE."

    capture noisily assert _N==0
    if _rc==0 {
    set obs 1
    gen CodeVar = "* No Value Labels defined, or error in do file, please check"
    di as err "No Value Labels defined, or error in do file, please check"
    pause
    }

    capture gen CodeVar=""

    gen Var=v1 if v2=="" & v1~="/"
    replace Var=Var[_n-1] if Var==""

    replace CodeVar="label define " + v1 if v2=="" & v1~="/" & CodeVar==""
    drop if CodeVar=="label define "
    gen Quot=`"""'
    replace CodeVar= " " + v1 + " " + Quot + v2 + Quot if CodeVar==""
    replace CodeVar= "; " + "label values " + Var + " " + Var + " " + ";" if v1=="/"
    ingap
    replace CodeVar="#delimit ;" in 1

    ingap -1, after
    replace CodeVar="#delimit cr ;" in l

    list v1 v2 CodeVar, sepby(Var)

    keep CodeVar
    ingap
    replace CodeVar=`"* Generate Value Labels from `SpsFile' "' in 1
    save ValueLabels.dta, replace


    clear

    use VariableLabels.dta
    gen VarLab=1
    append using ValueLabels.dta
    gen ValLabNum=_n if strmatch(CodeVar, "*label define *")
    replace ValLabNum=ValLabNum[_n-1] if ValLabNum==.

    capture erase VariableLabels.dta
    capture erase ValueLabels.dta

    * Get rid of prefixed $ symbol in varnames
    replace CodeVar=subinstr(CodeVar, "v$", "v", .)

    * Get rid of ' quotation mark around numerical codes
    * (below needs changing if you have coded variables <-1000 or >1000)
    forvalues Num = -1000/1000 {
    replace CodeVar=subinstr(CodeVar, "'`Num''", "`Num'", .)
    }

    * Implement Dataset specific label changes below if required:

    replace CodeVar=subinstr(CodeVar, "InterviewDateE", "InterviewDate_E", .)

    format CodeVar %-20s
    * Visual inspection:
    list CodeVar if VarLab==1
    pause Please check whether variable labelling commands look ok!

    list CodeVar if VarLab==., sepby(ValLabNum)
    pause Please check whether value labelling commands look ok!


    drop VarLab ValLabNum
    outfile using "LabVarsAndValues.do", noquote replace

    clear
    local DatFile : dir . files "*.dat", respect
    insheet using `DatFile', clear case

    * Implement Dataset specific Variable name changes below if required:



    do "LabVarsAndValues.do"

    local StataFile=subinstr(`DatFile', ".dat", ".dta",1)
    save "`StataFile'", replace



    In between the Label Variables and Values generation step and the importing of the data I'm uncertain of how best to deal with non-evaluable data sources - i.e. an "UNKNOWN" string in a date or binary field. Would it be best to include this during the generation of the Labels Variables and Values step (i.e. include a code for 5. "UNK") or replace all "UNK" fields?

    Any advice or code on handling this would be welcome?

    kind regards,
    Marcus

  • #2
    Assuming "unknown" won't mean anything (sometimes, it does) in this case, one way to specify a given type of - potentially - missing data is to - encode - the variable, then, use the - mvencode - command.
    Best regards,

    Marcos

    Comment

    Working...
    X