Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by William Lisowski View Post
    I'm sorry, but your description of what you did makes no sense to me given my understanding of your data.



    So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

    If this is not correct, tell me instead what it is you are doing.

    If this is correct, then do the following. If this isn't clear to you, don't do something else. Ask for clarification.

    1) Open the CSV in Excel.
    2) In Excel, add a new column to the worksheet and fill it with the line number.
    3) Save it as as a new CSV.

    The point is you now have a CSV which has line numbers, so that you can tell what observations are missing after it is read into Stata. Generating a (missing!) observation number in Stata after you have read the CSV into Stata as you described above is too late the because the missing rows are already missing!

    4) Read the new CSV with the line numbers into Stata using import delimited.
    5) In Stata, do
    Code:
    generate skip = Obs - Obs[_n-1]+1
    list Obs skip if skip>0
    where Obs is the name of the Stata variable that holds the line number from the CSV.

    Each line of the output will tell you the observation number *after* one or more dropped observations. And the value of skip will tell you how many observations were skipped. So if you get Obs=666 and skip=4 you will know that observations 662-665 were dropped.

    Look at the new CSV and find those observations. How are they different from observations that were not dropped?
    I imported two completely different csv files from two different sources when I use import delimited. I noticed many, in one case thousands, of observations dropped after importing the file. When I downloaded the csv files from the data sources, I saved them as a xlsx file. I imported these files into Stata as an Excel worksheet and none of the observations were dropped in either file. It's perplexing.

    Comment


    • #17
      Originally posted by William Lisowski View Post
      I'm sorry, but your description of what you did makes no sense to me given my understanding of your data.



      So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

      If this is not correct, tell me instead what it is you are doing.

      If this is correct, then do the following. If this isn't clear to you, don't do something else. Ask for clarification.

      1) Open the CSV in Excel.
      2) In Excel, add a new column to the worksheet and fill it with the line number.
      3) Save it as as a new CSV.

      The point is you now have a CSV which has line numbers, so that you can tell what observations are missing after it is read into Stata. Generating a (missing!) observation number in Stata after you have read the CSV into Stata as you described above is too late the because the missing rows are already missing!

      4) Read the new CSV with the line numbers into Stata using import delimited.
      5) In Stata, do
      Code:
      generate skip = Obs - Obs[_n-1]+1
      list Obs skip if skip>0
      where Obs is the name of the Stata variable that holds the line number from the CSV.

      Each line of the output will tell you the observation number *after* one or more dropped observations. And the value of skip will tell you how many observations were skipped. So if you get Obs=666 and skip=4 you will know that observations 662-665 were dropped.

      Look at the new CSV and find those observations. How are they different from observations that were not dropped?
      For every observation, with the exception of the first one, skip is coded as "2."

      Code:
      2020 2 20 1920 .75   1 .
      2020 2 20 1921 .75   2 2
      2020 2 20 1922 .75   3 2
      2020 2 20 1923 .75   4 2
      2020 2 20 1924 .75   5 2
      2020 2 20 1925 .75   6 2
      2020 2 20 1926 .75   7 2
      2020 2 20 1927 .75   8 2
      2020 2 20 1928 .75   9 2
      2020 2 20 1929 .75  10 2
      2020 2 20 1930 .75  11 2
      2020 2 20 1931 .75  12 2
      2020 2 20 1932 .75  13 2
      2020 2 20 1933 .75  14 2
      2020 2 20 1934 .75  15 2
      2020 2 20 1935 .75  16 2
      2020 2 20 1936 .75  17 2
      2020 2 20 1937 .75  18 2
      2020 2 20 1938 .75  19 2
      2020 2 20 1939 .75  20 2
      2020 2 20 1940 .75  21 2
      2020 2 20 1941 .75  22 2
      2020 2 20 1942 .75  23 2
      2020 2 20 1943 .75  24 2

      Comment


      • #18
        Sorry, typo, and I thought of some improvements.

        In post #16, change step 5 to
        Code:
        replace skip = Obs - (Obs[_n-1]+1)
        tab skip, missing
        list Obs skip if skip>0

        Comment


        • #19
          Originally posted by William Lisowski View Post
          Sorry, typo, and I thought of some improvements.

          In post #16, change step 5 to
          Code:
          replace skip = Obs - (Obs[_n-1]+1)
          tab skip, missing
          list Obs skip if skip>0
          Hi, William. Thank you. The following are the results:
          Code:
           tab skip, missing
          
                 skip |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    0 |     86,107      100.00      100.00
                 4406 |          1        0.00      100.00
                    . |          1        0.00      100.00
          ------------+-----------------------------------
                Total |     86,109      100.00
          
          . list Obs skip if skip>0
          
                 +--------------+
                 |   Obs   skip |
                 |--------------|
              1. |     1      . |
          42550. | 46956   4406 |
                 +--------------+
          Edit: I don't have the slightest clue why all those values were skipped. Nothing sticks out in the csv file.
          Last edited by Eli Aguado; 18 Jan 2020, 20:52.

          Comment


          • #20
            So apparently not every observation had skip = 2 as you reported in post #17.

            We see that 4405 observations starting at 42551 and ending at 42955 were omitted. My guess is that something is strange in your CSV file at or around observation 42551.

            In post #16 I wrote the very detailed descriptrion of what I thought you had done
            So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

            If this is not correct, tell me instead what it is you are doing.
            to which you replied

            When I downloaded the csv files from the data sources, I saved them as a xlsx file.
            which is not nearly so detailed. Were the data stored at the source as XLSX files or as CSV files? What does it mean to download a CSV file a the data source and save it as an XLSX file? Download is the verb that describes having a web site give you a copy of a file. Generally, my web browser is displaying the data source, I click on a link, and the data source sends the file in whatever format it has it in. If it's a CSV there's no way for the download process to change it to an XLSX file.

            Write a recipe that describes step by step how you got the CSV and XLSX files. Extra points if the data source is publicly accessible and you can give a URL to download a copy.




            Comment


            • #21
              Originally posted by William Lisowski View Post
              So apparently not every observation had skip = 2 as you reported in post #17.

              We see that 4405 observations starting at 42551 and ending at 42955 were omitted. My guess is that something is strange in your CSV file at or around observation 42551.

              In post #16 I wrote the very detailed descriptrion of what I thought you had done


              to which you replied



              which is not nearly so detailed. Were the data stored at the source as XLSX files or as CSV files? What does it mean to download a CSV file a the data source and save it as an XLSX file? Download is the verb that describes having a web site give you a copy of a file. Generally, my web browser is displaying the data source, I click on a link, and the data source sends the file in whatever format it has it in. If it's a CSV there's no way for the download process to change it to an XLSX file.

              Write a recipe that describes step by step how you got the CSV and XLSX files. Extra points if the data source is publicly accessible and you can give a URL to download a copy.



              With one data source, the data source provides the following: 1) the data saved as a csv file and 2) the data saved as an excel worksheet (.xlsx file). When I import the csv data with import delimited, observations are dropped. When I import the .xlsx file as an Excel worksheet into Stata no observations are dropped.

              With the data used in the latest posts, it is only provided by the authors as a csv file. When I download the file, and import it the csv file into stata using import delimited, I lose observations. When I download the csv file, and then save as an xlsx file, import it into Stata as an Excel worksheet, no observations are dropped. I can provide you the URL for this data file, but I would need to message you as I already had two separate cases of my research being stolen; so, needless to say, I am a bit paranoid at this stage.

              Comment


              • #22
                Eli Aguado and I have been in correspondence and I have downloaded his data from their original sources. I have a subsequent post coming but want to start with this very pointed one.

                Eli - It appears that there were a number of updates to Stata 14 across its life that addressed problems with import delimited and in particular a problem that caused "portions of the data" to be skipped. This is a fairly niche problem involving line-end characters in the input dataset, perhaps more likely encountered on a Mac.

                If your copy of Stata 14 is not fully updated, you should update it now. That may be sufficient to resolve your problems. Assuming you have administrator privileges on your system, type update query in Stata's Command window. On my system, running Stata 16, I see
                Code:
                . update query
                (contacting http://www.stata.com)
                
                Update status
                    Last check for updates:  21 Jan 2020
                    New update available:    08 Jan 2020  (what's new)
                    Current update level:    11 Dec 2019  (what's new)
                
                Possible actions
                
                    Install available updates             (or type -update all-)
                which tells me updates are available and gives me a link to click or a command to type in order to update my copy of Stata.

                Comment


                • #23
                  The first example dataset has 151,762 lines in total. Lines 1-4 are heading information, and nothing in them is suitable for use as variable names. Lines 5-151,758 contain 14 columns of comma-separate values. Columns 2, 3, and 4 contain text; the other columns apparently contain numbers. So my attempt to import the data follows.
                  Code:
                  . import delimited Example1.csv, rowrange(5) varnames(nonames)
                  (14 vars, 151,758 obs)
                  
                  . describe
                  
                  Contains data
                    obs:       151,758                          
                   vars:            14                          
                  --------------------------------------------------------------------------------
                                storage   display    value
                  variable name   type    format     label      variable label
                  --------------------------------------------------------------------------------
                  v1              int     %8.0g                
                  v2              str41   %41s                  
                  v3              str41   %41s                  
                  v4              str6    %9s                  
                  v5              str6    %9s                  
                  v6              str6    %9s                  
                  v7              str6    %9s                  
                  v8              str6    %9s                  
                  v9              str6    %9s                  
                  v10             str6    %9s                  
                  v11             str6    %9s                  
                  v12             str6    %9s                  
                  v13             str6    %9s                  
                  v14             str6    %9s                  
                  --------------------------------------------------------------------------------
                  Sorted by:
                       Note: Dataset has changed since last saved.
                  Now, the first thing I notice is that import delimited in my installation of Stata 16 did not reproduce the problem Eli reported. I hope that my advice in post #22 resolves his problem. I'll continue this exploration to its conclusion, however.

                  The next thing I notice is that Stata has decided that columns 5-15 contain non-numeric data and has imported them as string variables rather than numeric variables.

                  Here's a standard approach to figuring out what's wrong.
                  Code:
                  . foreach v of varlist v5-v14 {
                    2.     tab `v' if missing(real(`v')), missing
                    3.         }
                  
                           v5 |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                              |     56,460       95.21       95.21
                            * |      2,842        4.79      100.00
                  ------------+-----------------------------------
                        Total |     59,302      100.00
                  
                           v6 |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                              |    113,216       98.77       98.77
                            * |      1,406        1.23      100.00
                  ------------+-----------------------------------
                        Total |    114,622      100.00
                  
                  [remaining tabs omitted]
                  It turns out that 7,000+ lines in the CSV have a single asterisk in one or more of columns 5-14. Looking at the CSV in a text editor shows me where to look for examples.
                  Code:
                  . list v5-v14 in 141710/141714, clean
                  
                            v5   v6   v7   v8   v9   v10   v11   v12   v13   v14  
                  141710.    *    0    0    0    0     0     0           *     0  
                  141711.    *    *   36    0    0     0    36    36     *     *  
                  141712.   14   14    6    0    0     0     0          20    20  
                  141713.    9    7    0    0    0     0     *     *     8     7  
                  141714.    *    *    5    0    0     *     0     *     *     *
                  We see that there are both standard text missing values (e.g v12 in observation 141710) and asterisks (v6 and v13 in observation 141710). I don't know what to make of the asterisks, so I convert them to the text representation of one of Stata's special numeric missing values. I can then destring each of the variables and have a workable dataset that fully represents the input data, including the anomalous asterisks for me to deal with once I understand their meaning.
                  Code:
                  . foreach v of varlist v5-v14 {
                    2.     replace `v' = ".s" if `v'=="*"
                    3.         destring `v', replace
                    4.         }
                  (2,842 real changes made)
                  v5: all characters numeric; replaced as long
                  ...
                  v14: all characters numeric; replaced as long
                  (112607 missing values generated)
                  
                  . describe
                  
                  Contains data
                    obs:       151,758                          
                   vars:            14                          
                  --------------------------------------------------------------------------------
                                storage   display    value
                  variable name   type    format     label      variable label
                  --------------------------------------------------------------------------------
                  v1              int     %8.0g                
                  v2              str41   %41s                  
                  v3              str41   %41s                  
                  v4              str6    %9s                  
                  v5              long    %10.0g                
                  v6              long    %10.0g                
                  v7              long    %10.0g                
                  v8              long    %10.0g                
                  v9              long    %10.0g                
                  v10             long    %10.0g                
                  v11             long    %10.0g                
                  v12             long    %10.0g                
                  v13             long    %10.0g                
                  v14             long    %10.0g                
                  --------------------------------------------------------------------------------
                  Sorted by:
                       Note: Dataset has changed since last saved.
                  
                  . list v5-v14 in 141710/141714, clean
                  
                            v5   v6   v7   v8   v9   v10   v11   v12   v13   v14  
                  141710.   .s    0    0    0    0     0     0     .    .s     0  
                  141711.   .s   .s   36    0    0     0    36    36    .s    .s  
                  141712.   14   14    6    0    0     0     0     .    20    20  
                  141713.    9    7    0    0    0     0    .s    .s     8     7  
                  141714.   .s   .s    5    0    0    .s     0    .s    .s    .s
                  One might argue that I could have used the numericcols() option on import delimited to force the storage of v5-v14 as numeric variables, replacing all anomalous content with missing values. But there are two arguments against that. First, the tabs I did are required to understand your data. Wholesale replacement of non-numeric data with missing values risks overlooking important problems in your data. And second, perhaps the asterisks are meaningfully different than the missing values that also appear in other variables on the same observation, and that meaning should be taken into account in the analysis. By replacing the asterisks with special missing values, I can distinguish them in subsequent analyses.

                  Added in Edit: The second example dataset imported perfect and had no missing or non-numeric values. This reinforces my believe that the problem lies within Eli's copy of Stata 14 and may be remedied by updating.
                  Last edited by William Lisowski; 21 Jan 2020, 11:40.

                  Comment


                  • #24
                    Originally posted by William Lisowski View Post
                    The first example dataset has 151,762 lines in total. Lines 1-4 are heading information, and nothing in them is suitable for use as variable names. Lines 5-151,758 contain 14 columns of comma-separate values. Columns 2, 3, and 4 contain text; the other columns apparently contain numbers. So my attempt to import the data follows.
                    Code:
                    . import delimited Example1.csv, rowrange(5) varnames(nonames)
                    (14 vars, 151,758 obs)
                    
                    . describe
                    
                    Contains data
                    obs: 151,758
                    vars: 14
                    --------------------------------------------------------------------------------
                    storage display value
                    variable name type format label variable label
                    --------------------------------------------------------------------------------
                    v1 int %8.0g
                    v2 str41 %41s
                    v3 str41 %41s
                    v4 str6 %9s
                    v5 str6 %9s
                    v6 str6 %9s
                    v7 str6 %9s
                    v8 str6 %9s
                    v9 str6 %9s
                    v10 str6 %9s
                    v11 str6 %9s
                    v12 str6 %9s
                    v13 str6 %9s
                    v14 str6 %9s
                    --------------------------------------------------------------------------------
                    Sorted by:
                    Note: Dataset has changed since last saved.
                    Now, the first thing I notice is that import delimited in my installation of Stata 16 did not reproduce the problem Eli reported. I hope that my advice in post #22 resolves his problem. I'll continue this exploration to its conclusion, however.


                    The next thing I notice is that Stata has decided that columns 5-15 contain non-numeric data and has imported them as string variables rather than numeric variables.

                    Here's a standard approach to figuring out what's wrong.
                    Code:
                    . foreach v of varlist v5-v14 {
                    2. tab `v' if missing(real(`v')), missing
                    3. }
                    
                    v5 | Freq. Percent Cum.
                    ------------+-----------------------------------
                    | 56,460 95.21 95.21
                    * | 2,842 4.79 100.00
                    ------------+-----------------------------------
                    Total | 59,302 100.00
                    
                    v6 | Freq. Percent Cum.
                    ------------+-----------------------------------
                    | 113,216 98.77 98.77
                    * | 1,406 1.23 100.00
                    ------------+-----------------------------------
                    Total | 114,622 100.00
                    
                    [remaining tabs omitted]
                    It turns out that 7,000+ lines in the CSV have a single asterisk in one or more of columns 5-14. Looking at the CSV in a text editor shows me where to look for examples.
                    Code:
                    . list v5-v14 in 141710/141714, clean
                    
                    v5 v6 v7 v8 v9 v10 v11 v12 v13 v14
                    141710. * 0 0 0 0 0 0 * 0
                    141711. * * 36 0 0 0 36 36 * *
                    141712. 14 14 6 0 0 0 0 20 20
                    141713. 9 7 0 0 0 0 * * 8 7
                    141714. * * 5 0 0 * 0 * * *
                    We see that there are both standard text missing values (e.g v12 in observation 141710) and asterisks (v6 and v13 in observation 141710). I don't know what to make of the asterisks, so I convert them to the text representation of one of Stata's special numeric missing values. I can then destring each of the variables and have a workable dataset that fully represents the input data, including the anomalous asterisks for me to deal with once I understand their meaning.
                    Code:
                    . foreach v of varlist v5-v14 {
                    2. replace `v' = ".s" if `v'=="*"
                    3. destring `v', replace
                    4. }
                    (2,842 real changes made)
                    v5: all characters numeric; replaced as long
                    ...
                    v14: all characters numeric; replaced as long
                    (112607 missing values generated)
                    
                    . describe
                    
                    Contains data
                    obs: 151,758
                    vars: 14
                    --------------------------------------------------------------------------------
                    storage display value
                    variable name type format label variable label
                    --------------------------------------------------------------------------------
                    v1 int %8.0g
                    v2 str41 %41s
                    v3 str41 %41s
                    v4 str6 %9s
                    v5 long %10.0g
                    v6 long %10.0g
                    v7 long %10.0g
                    v8 long %10.0g
                    v9 long %10.0g
                    v10 long %10.0g
                    v11 long %10.0g
                    v12 long %10.0g
                    v13 long %10.0g
                    v14 long %10.0g
                    --------------------------------------------------------------------------------
                    Sorted by:
                    Note: Dataset has changed since last saved.
                    
                    . list v5-v14 in 141710/141714, clean
                    
                    v5 v6 v7 v8 v9 v10 v11 v12 v13 v14
                    141710. .s 0 0 0 0 0 0 . .s 0
                    141711. .s .s 36 0 0 0 36 36 .s .s
                    141712. 14 14 6 0 0 0 0 . 20 20
                    141713. 9 7 0 0 0 0 .s .s 8 7
                    141714. .s .s 5 0 0 .s 0 .s .s .s
                    One might argue that I could have used the numericcols() option on import delimited to force the storage of v5-v14 as numeric variables, replacing all anomalous content with missing values. But there are two arguments against that. First, the tabs I did are required to understand your data. Wholesale replacement of non-numeric data with missing values risks overlooking important problems in your data. And second, perhaps the asterisks are meaningfully different than the missing values that also appear in other variables on the same observation, and that meaning should be taken into account in the analysis. By replacing the asterisks with special missing values, I can distinguish them in subsequent analyses.

                    Added in Edit: The second example dataset imported perfect and had no missing or non-numeric values. This reinforces my believe that the problem lies within Eli's copy of Stata 14 and may be remedied by updating.
                    Dear William,
                    As usual, thank you for taking the time to look into this situation.

                    I updated Stata, and the number of observations were no longer altered. I feel a bit bad with the back-and-forth for a simple fix, but I guess I learned an important lesson. Hopefully this will help others too.

                    More so, thank you for providing useful advice regarding special missing values, as opposed to wholesale replacement. This is something that I always wondered about--so thank you!

                    Comment

                    Working...
                    X