Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing string values with numerical values

    Hi everyone,

    I'm working on a project using multiple tennis data sets (initially csv files, using a loop to run them into Stata) and two of the columns in my original data, LRank and LPts (referring to the rank of the player who loses the match and LPts referring to the number of points the losing player has coming into the match) contain numbers, except for a few rows where the value is "N/A." When I run my loop function, all the datasets are displayed in Stata, however, once Stata reaches a dataset that contains "N/A" in one of these columns, it doesn't input any of the values for LRank or LPts. I've tried the following code below but when I do this, I normally get a type mismatch error:

    gen LRank_str_num = string(LRank)
    gen LPts_str_num = string(LPts)

    replace LRank_str_num = 99999 if LRank_str_num == "N/A"
    destring LRank_str_num, replace
    replace LPts_str_num = 99999 if LPts_str_num == "N/A"
    destring LPts_str_num, replace

    I've included the rest of my loop function here for reference but nothing seems to be wrong with that. The problem is the LRank and LPts columns when they contain "N/A" as a value. All I need to do I believe is add a line of code in my loop function that will change those values to a number (ie replace LPts = 99999 if LPts == "N/A") or convert the entire column into string data, then change the "N/A" values to a number, then convert the column back to integer values.

    cd "/Users/rtwb11/Dropbox/23-ECON3720-Project/Original Data"
    local files : dir "`filepath'" files "*.csv"
    di `"`files'"'
    tempfile master
    save `master', replace empty
    foreach x of local files {
    di "`x'"

    qui: import delimited "`x'", delimiter(",") case(preserve) clear
    qui: gen id = subinstr("`x'", ".csv", "", .)

    append using `master', force
    save `master', replace
    }

    save projectdata

    Again, I am new to using Stata and appreciate any help. If additional code or data would be helpful, please let me know. I would include the -dataex- values however each tournament is >100 rows so the -dataex- doesn't display the problem I am trying to describe because the first dataset doesn't contain "N/A" in any of its rows. Thank you!

  • #2
    Since you're trying to replace a string variable, you should probably be typing "9999" otherwise you are trying to input a number into a string variable. You want to replace it with a string that just happens to be the number nine repeated many times. Then you can destring the variable. Of course if you just want to get rid of the "N/A" you can just use the force option of destring to convert them to missing data.

    Comment


    • #3
      The lack of a data example is frustrating here but it sounds as if LRank and LPts are already string, and if so trying to make further string variables out of them won't work. Other way tound if they are numeric then someone -- you or someone else -- pushed them through encode. which has to be reversed.

      Replacing NA by 9999 is a really bad idea.

      What you can do is just show us the results of

      Code:
      describe Lrank LPts 
      
      tab1 Lrank LPts

      Comment


      • #4
        Here is the describe LRank LPts output and the tab1 LRrank and LPts. I condensed the tab1 LRank and LPts because what it's giving me is the output it seems from a dataset that doesn't include any N/A rows. It has all frequencies as 1 and percent as 0.79.
        The describe says LRank and LPts are stored as ints which is why I was trying to convert them to str. I've also linked where I retrieved the data from (http://tennis-data.co.uk/ausopen.php). If you click on ATP Men's Tour 2023, you can see in Row 35 that LRank and LPts are N/A, for example.

        I also noticed when the loop was running, these highlighted lines were different than the other variable lines, and don't know if that has something to do with the problem. Sorry for the issues with my first question, and I hope this helps to resolve them.
        Click image for larger version

Name:	Screen Shot 2023-04-17 at 11.26.05 PM.png
Views:	2
Size:	290.3 KB
ID:	1710209


        Variable Storage Display Value
        name type format label Variable label
        ----------------------------------------------------------------------------------------------------------------------------------------------------------------
        LRank int %8.0g
        LPts int %8.0g


        LRank | Freq. Percent Cum.
        ------------+-----------------------------------
        1 | 1 0.79 0.79
        2 | 1 0.79 1.57
        3 | 1 0.79 2.36

        ...
        315 | 1 0.79 98.43
        535 | 1 0.79 99.21
        600 | 1 0.79 100.00
        ------------+-----------------------------------
        Total | 127 100.00


        LPts | Freq. Percent Cum.
        ------------+-----------------------------------
        50 | 1 0.79 0.79
        66 | 1 0.79 1.57
        170 | 1 0.79 2.36
        ...
        5561 | 1 0.79 98.43
        5940 | 1 0.79 99.21
        8017 | 1 0.79 100.00
        ------------+-----------------------------------
        Total | 127 100.00

        Attached Files

        Comment


        • #5
          Thanks for the extra details. You have shown results for a dataset with no N/A values and on your evidence all looks good. But if so there is no point is generating string versions of well behaved numeric variables.

          That wasn't my question, nor yours in #1 The question is about a dataset that did include N/A values. Your image shows the record of some loop over datasets, but doesn;t show or explain the underlying code.

          So, in terms of advice I am still where I was at #3. The solution could be as simple as

          Code:
          destring LRank LPts, ignore("N/A") replace
          or it could be that you need more drastic surgery but I can't be certain which is the answer.

          Comment


          • #6
            Thank you for your help, Nick! The code you described seemed to fix the problem I was having. Just as a follow-up question- what does the "replace" part of the code do? It doesn't actually replace anything as there are still a few rows where LRank or LPts are "." but it is no longer every row like I was having trouble with earlier.

            Comment

            Working...
            X