I'd like to throw out a topic for discussion that perhaps doesn't have an exact answer (or perhaps it's quite elementary).
I am in the progress of, and also stuck in, a data appending task. However, the datasets were from different sources and many of them have apparently been processed by STATA, and therefore stored in various 'storage types'.
For example, dataset A and B have the following two variables but with different storage types:
Dataset A
Dataset B
I must admit that dealing with date data is a real pain. Especially as some data is defined in milliseconds (e.g., stored as 715305600000 and displayed as 03jan1993 00:00:00) and some in days (e.g., stored as 15352 and displayed as 12jan2002).
Certainly, I can use many commands/functions such as recast, format, floor() to manipulate the storage type and/or display format in order to appropriately append B to A. I found that when appending data, STATA is particularly strict about the formatting of the data, and it is easy for a type mismatch to result in a null value.
The problem is that both datasets have more than 500 variables. I may be able to check and reformat 5-8 important variables in order to achieve the correct data append, while it would be unrealistic to check all variables.
Questions:
I have an idea that I don't know if it's right (it might sound a bit speculative). As both datasets are complete in terms of the content (i.e. no corrections need to be made). Can I export the datasets to .csv format, then re-import them back to STATA? So, all variables will be converted to string format, and all datasets can be appended smoothly. Certainly, I need to convert the date and other numeric variables back to the sensible formats (e.g., int, float...).
In the original datasets, most variables were indeed numeric (e.g., 0, 1, 2...) at the backend, but coded using value labels (e.g., 0: Sydney; 1: Canberra...). In the above scenarios regardless whether using original datasets (with value labels), or re-import from csv (no value labels), if I tabulate the variable (e.g., tab servicelocation), it shows:
It seems that even when certain categorical variables are stored in string format (red color), they do not seem to affect the analysis.
So, my final question is: regardless of the date and numeric variables, does it cause any problems if the background values of other variables are erased and only the categories defined by "value labels" are retained?
Thanks.
I am in the progress of, and also stuck in, a data appending task. However, the datasets were from different sources and many of them have apparently been processed by STATA, and therefore stored in various 'storage types'.
For example, dataset A and B have the following two variables but with different storage types:
Dataset A
Variable name | Storage type | Display format |
servicelocation | byte | %33.0g |
dob | double | %td.. |
Variable name | Storage type | Display format |
servicelocation | str18 | %18s |
dob | float | %tc |
Certainly, I can use many commands/functions such as recast, format, floor() to manipulate the storage type and/or display format in order to appropriately append B to A. I found that when appending data, STATA is particularly strict about the formatting of the data, and it is easy for a type mismatch to result in a null value.
The problem is that both datasets have more than 500 variables. I may be able to check and reformat 5-8 important variables in order to achieve the correct data append, while it would be unrealistic to check all variables.
Questions:
I have an idea that I don't know if it's right (it might sound a bit speculative). As both datasets are complete in terms of the content (i.e. no corrections need to be made). Can I export the datasets to .csv format, then re-import them back to STATA? So, all variables will be converted to string format, and all datasets can be appended smoothly. Certainly, I need to convert the date and other numeric variables back to the sensible formats (e.g., int, float...).
In the original datasets, most variables were indeed numeric (e.g., 0, 1, 2...) at the backend, but coded using value labels (e.g., 0: Sydney; 1: Canberra...). In the above scenarios regardless whether using original datasets (with value labels), or re-import from csv (no value labels), if I tabulate the variable (e.g., tab servicelocation), it shows:
Service Location | Freq. | Precent | Cum. |
Sydney | 1,321 | 34.49 | 34.49 |
Canberra | 377 | 6.61 | 41.10 |
... | ... | ... | ... |
So, my final question is: regardless of the date and numeric variables, does it cause any problems if the background values of other variables are erased and only the categories defined by "value labels" are retained?
Thanks.
Comment