Dear forum,
I used an R function called -pdftools- allowing me to turn documentation PDF files into datasets, and I am now using Stata to construct a sort of "metadata" base. Here's how my working database looks like (with a bit of cleaning) :
As you can see, each observation is one word (output of the R function), and I have access to information such as the font characteristics, the coordinates of the word, the page number, etc. What I want to do now is to build back the sentences as they were in the PDF, and in order to do this I need to look at the font size and font name details. I need a code that does this :
"While there is no change in BOTH variables font_name and font_size,
paste the content of text in observation i into the content of text in observation i-1 (with a space) and drop observation i. "
The final dataset should look like this (the rest of the variables should be unchanged)
Please, I'd appreciate any help attempt!
Valentine
I used an R function called -pdftools- allowing me to turn documentation PDF files into datasets, and I am now using Stata to construct a sort of "metadata" base. Here's how my working database looks like (with a bit of cleaning) :
Code:
* Example generated by -dataex-. For more info, type help dataex clear input long(width height x y space) str126 text str27 font_name double font_size long(page obs_number) 89 13 56 572 0 "VARNAME" "BAAAAA+Arial-BoldMT" 12 73 22435 21 11 56 588 1 "This" "BAAAAA+Arial-BoldMT" 10 73 22436 17 11 81 588 1 "bit" "BAAAAA+Arial-BoldMT" 10 73 22437 54 11 101 588 1 "is" "BAAAAA+Arial-BoldMT" 10 73 22438 3 11 158 588 1 "a" "BAAAAA+Arial-BoldMT" 10 73 22439 18 11 164 588 1 "variable" "BAAAAA+Arial-BoldMT" 10 73 22440 39 11 186 588 0 "description" "BAAAAA+Arial-BoldMT" 10 73 22441 35 11 56 604 1 "The" "CAAAAA+ArialMT" 10 73 22442 11 11 95 604 1 "input" "CAAAAA+ArialMT" 10 73 22443 28 11 109 604 1 "variable" "CAAAAA+ArialMT" 10 73 22444 5 11 140 604 1 "=" "CAAAAA+ArialMT" 10 73 22445 42 11 148 604 0 "INPUT_VAR" "CAAAAA+ArialMT" 10 73 22446 23 11 56 620 1 "List" "BAAAAA+Arial-BoldMT" 10 73 22447 17 11 82 620 1 "of" "BAAAAA+Arial-BoldMT" 10 73 22448 28 11 102 620 0 "values" "BAAAAA+Arial-BoldMT" 10 73 22449 22 11 130 633 1 "Missing" "CAAAAA+ArialMT" 10 73 22450 21 11 155 633 1 "data" "CAAAAA+ArialMT" 10 73 22451 11 11 180 633 1 "or" "CAAAAA+ArialMT" 10 73 22452 27 11 194 633 1 "doesn't" "CAAAAA+ArialMT" 10 73 22453 49 11 224 633 0 "know" "CAAAAA+ArialMT" 10 73 22454 5 10 59 649 0 "1" "CAAAAA+ArialMT" 9 73 22455 15 11 130 648 0 "Yes" "CAAAAA+ArialMT" 10 73 22456 5 10 59 664 0 "2" "CAAAAA+ArialMT" 9 73 22457 18 11 130 664 0 "No" "CAAAAA+ArialMT" 10 73 22458 end
"While there is no change in BOTH variables font_name and font_size,
paste the content of text in observation i into the content of text in observation i-1 (with a space) and drop observation i. "
The final dataset should look like this (the rest of the variables should be unchanged)
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str34 text "VARNAME" "This bit is a variable description" "The input variable = INPUT_VAR" "List of values" "Missing data or doesn't know" "1" "Yes" "2" "No" end
Valentine
Comment