Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • problems with format data when appending

    I am having problems in appending some datasets because some variables do not have the same format between different datasets. I tried to destring all possible variables and saved the datasets again in order to append them all afterwards. But I still had the problem once some variables could not destring because it has some observations in string in some datasets. So I did the other way around: tostring all variables in all datasets before appending them all. the problem persists: some variables are string in a dataset and float in others

    1st try:
    foreach i in AC {
    foreach a in 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 {
    foreach k in 01 02 03 04 05 06 07 08 09 10 11 12 {
    use ${AIHRED}/RD`i'`a'`k'.dta, clear
    destring*, replace
    save ${AIHRED_t}/RD`i'`a'`k'_t.dta, replace
    clear
    }
    }
    }
    foreach i in AC {
    use ${AIHRED_t}/RD`i'9201_t.dta, clear
    foreach k in 02 03 04 05 06 07 08 09 10 11 12 {
    append using ${AIHRED_t}/RD`i'92`k'_t.dta
    }
    foreach a in 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 {
    foreach k in 01 02 03 04 05 06 07 08 09 10 11 12 {
    append using ${AIHRED_t}/RD`i'`a'`k'_t.dta
    }
    }
    destring *, replace
    compress
    save ${AIHRED_t}/RD`i'.dta, replace
    clear
    }

    2nd try:
    foreach i in AC {
    foreach a in 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 {
    foreach k in 01 02 03 04 05 06 07 08 09 10 11 12 {
    use ${AIHRED}/RD`i'`a'`k'.dta, clear
    tostring*, replace
    save ${AIHRED_t}/RD`i'`a'`k'_t.dta, replace
    clear
    }
    }
    }
    foreach i in AC {
    use ${AIHRED_t}/RD`i'9201_t.dta, clear
    foreach k in 02 03 04 05 06 07 08 09 10 11 12 {
    append using ${AIHRED_t}/RD`i'92`k'_t.dta
    }
    foreach a in 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 {
    foreach k in 01 02 03 04 05 06 07 08 09 10 11 12 {
    append using ${AIHRED_t}/RD`i'`a'`k'_t.dta
    }
    }
    destring *, replace
    compress
    save ${AIHRED_t}/RD`i'.dta, replace
    clear
    }

    error displayed for the 2nd try:
    variable us_sangue is str1 in master but float in using data
    You could specify append's force option to ignore this string/numeric mismatch. The using variable would then be treated as if it
    contained "".
    r(106);


  • #2
    Your code is pretty complicated to read, so I'm not sure what is going on. My best guess is that the loops that -tostring- the data are not reaching all of the files that you want to append later. I think a better way to do this is to combine the -tostring- and -append- steps into one. You seem to have used the different loop structures because you felt the need to start the append process with one of your already saved files. But if you start with an empty file, then you can do the -append-s right after you do the -tostring-s, guaranteeing that every file you -append- has undergone the -destring-. So something like this:

    Code:
    clear
    tempfile building
    save `building', emptyok
    foreach i in AC {
     foreach a in 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 {
      foreach k in 01 02 03 04 05 06 07 08 09 10 11 12 {
       use ${AIHRED}/RD`i'`a'`k'.dta, clear
       tostring *, replace
       append using `building'
       save `"`building'"', replace
      }
     }
    }
    use `building', clear
    compress
    save  ${AIHRED_t}/RD`i'.dta, replace
    Notes:
    1. The foreach i in AC loop seems pointless. Since i only takes on a single value, "AC", you can omit this loop and just replace `i' by AC in the -use- command.
    2. Though it isn't likely the source of any problem for you, the use of global macros to store information is unsafe and should be resorted to only when there is no alternative. The problem is that you may be clobbering some other program's definition of ${AIHRED}, or some other program may be clobbering yours. When you use a local macro, those name clashes are automatically avoided, which also avoids obscure errors. As I say, it is not likely this is your current problem, but something to think about going forward.

    Hope this helps.

    Comment


    • #3
      Clyde has excellent advice as always. I'll just flag that in Stata format means display format; the problem here is one of different storage types.

      Comment


      • #4
        Thank you very much, your suggestion is much cleanner!
        But I still have the same problem: the program leaves variables as float in some datasets and as strings in others (error displayed below). How can all variables have the same storage types (thanks Nick) so I can append them all?

        variable us_sangue is float in master but str1 in using data
        You could specify append's force option to ignore this numeric/string mismatch. The using variable would then be treated as if it
        contained numeric missing value.

        Comment


        • #5
          I don't know why one of my variables could not be "tostringed" in some of my datasets.

          the log file showed me this:

          Code:
          us_sangue cannot be converted reversibly; no replace
          Do someone know a way out?

          Comment


          • #6
            "tostring, replace force" will probably do the job.

            Ideally, one would want to keep the old numeric values, using the "generate()" option.

            Examine the variable(s) that it complains about. If they're large integers, so that you're really losing information, then maybe divide them by 1000 or 1,000,000 or whatever prior to tostring. If they're just losing a little precision after the decimal place, you can probably live with it.

            Comment


            • #7
              I cannot loose those numeric values. Actually the variable shoud be a numeric one, but due to some problematic observations, I cannot destring all of them, so I tried to tostring them.

              So if with the ,force option I loose these observations, it is not an option for me. I would like to avoid destring, force as well once I would like to analyse the problematic observations. I would like to append all data without loosing information.

              The problem now is why a varible cannot be tostringed? I thought all of them could.

              Comment


              • #8
                Some numbers cannot be -tostring-ed because there would be a precision problem when you back convert from string to numeric: some information would be lost. So -tostring- won't let you do that.

                Overall, since in the end you want these to be numeric variables, it seems like it would be a good idea to go into the original data sets, one at a time perhaps, that have this as a string variable and find out why you can't -destring- them. So if your variable is x and Stata refuses to -destring- it you can explore the problem with:

                Code:
                list x if missing(real(x)) & !missing(x)
                There may be obvious things that turn up like N/A that should be turned into some sort of missing value code, or just clobbered with an empty string, or currency symbols that should be either -ignore()-d in the -destring- or removed from the variable altogether. Or you may see things that look like perfectly good numbers. In that case the problem is usually hidden "non-printing" characters. You can find those with:

                Code:
                charlist x // from SSC, by Nick Cox
                return list
                This will show you all of the characters that appear in any value of the variable x. If there are non-printing characters, you won't see those in the immediate output of -charlist-, but their ASCII codes will be found in r(ascii). Usually such non-printing characters just need to be -ignore()-d or removed from x to start with as they are rarely informative.

                Comment


                • #9
                  I like Clyde's suggestion, and it is the better approach if you have the time to sludge through all of the datasets and variables. But I don't see what's horrible or impossible with "tostring, force." It will only keep ten significant digits, but that's precision enough for most purposes.

                  For that matter, with the format() option, you can keep a lot more digits, though if your variables don't fit the same format, this may not be useful.
                  Last edited by ben earnhart; 13 Jun 2015, 16:03.

                  Comment


                  • #10
                    Ben's approach is also reasonable. But remember, if these variables really are supposed to be numbers, and you are eventually going to deal with -destring-ing them, you will have to clean up all the non-numeric stuff anyway at that point. Since that stuff seems to be getting in the way of your -append- process, and the workaround requires you to do something that may entail loss of information (although I agree that the information lost will be slight and probably of little or no practical significance), maybe it makes sense to get that part out of the way first.

                    Comment


                    • #11
                      So if you have time to do it right, follow Clyde's approach. But based on 22*12=264 datasets, with multiple variables per dataset, it might take weeks unless you can find a pattern to where destring is going awry.

                      In which case (assuming you don't find a pattern), losing a little precision may be worth it. Be careful of large numbers -- losing a few decimal points isn't a big deal, but if they are large numbers, then arbitrarily chopping off the end of the variable really does mess things up, thus be prepared to divide by a million or something like that prior to string conversion. Once you have a single dataset, then it's a lot more manageable to find and eradicate the true string values.

                      Comment

                      Working...
                      X