Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Modify Stata Data (Dta file) using python code and save changes back to the dta file

    I have 2000 observations and 100 variables in my data from an empirical research project. I am currently doing data cleaning which involves recoding some variables for hundreds of my observations. Currently I am using this method below to change to do the cleaning:

    foreach num in 1234 2345 3456 4567 5678 6789 ///
    7891 8911 9110 1102 1103 1104 ///
    {
    replace q786 = 2 if obsid == `num'

    }

    As you can imagine I will have to enter the values of every single observation ID in the for loop just to do the same modification, for hundreds of observations for tens of variables.
    I was trying to put all the ID in a text file so that I could import the text file using python code and go line-by -line doing the same modification which would save me time and make my code concise. However, I am not able to find information on how I could import the entire do file into my python code, modify variables for observations of interest and save back the modified do file to continue with stata. Can someone help me with this?
    I am thinking about this approach:

    python:
    from sfi import Data
    ## sudo code
    ##import text file which contains IDs
    ##read IDs line by line
    ##replace the variable of interest with a new value
    ## save the dta file
    ## exit python
    Last edited by Fabian Mkocheko; 24 Oct 2021, 14:09. Reason: python

  • #2
    There is no need to go through Python to do this.

    Here's an example of the approach using the auto.dta

    Code:
    sysuse auto, clear
    
    gen byte to_use = inlist(_n, 1, 7, 33, 48, 49, 55, 59, 62, 67, 73)
    
    replace price = 7500 if to_use
    replace mpg = 40 if to_use
    replace weight = 3000 if to_use
    // etc.
    Now, -inlist()- has a limit of 250 arguments, so if there are more than 249 observations that will be modified, then you will need to use a few -inlist()- expressions, joined by the | operator. Or, better still: what determines which observations are to be modified. Is there some expression involving the data set variables that identifies them, or is it really just a completely arbitrary list of line numbers? If the former, then you can simplify it by using that expression to define the variable to_use. If the latter, where did you get it from? Perhaps you have, or can create, a data set, let's call it line_number_list.dta that contains a variable, call it obs_no, that has all and only those line numbers in it. In that case you can do it like this:

    Code:
    sysuse auto, clear
    gen long obs_no = _n
    merge 1:1 obs_no using line_number_list
    gen byte to_use = _merge == 3
    drop _merge
    Finally, if you balk at typing -if to_use- at the end of a large number of commands, you can even get around that as follows:

    Code:
    sysuse auto, clear
    
    gen byte to_use = inlist(_n, 1, 7, 33, 48, 49, 55, 59, 62, 67, 73)
    
    frame put _all if to_use, into(to_be_modified)
    drop if to_use
    
    frame change to_be_modified
    replace price = 7500
    replace mpg = 40
    replace weight = 3000
    // etc.
    
    frame change default
    frameappend to_be_modified, drop
    Note: -frameappend- is written by Jeremy Freese and is available from SSC.

    Comment


    • #3
      Clyde has touched upon the seemingly arbitrary list of observation numbers. I will point out the obvious: this extremely error-prone as the correct results crucially depends on the sort order of the data. If you want to change those values via syntax scripts, then perhaps that script should include some lines that establish the required sort order.

      Clyde has also shown ways of doing what you want to do in Stata. Without going into detail, I will point to

      Code:
      help file
      which shows a way of reading in (and writing) text (and even binary) files in Stata. So, even if you wanted to stick with your suggested approach, there is no need, whatsoever, to use Python here.

      Comment

      Working...
      X