Rich Pauloo's examples of where getSectionText() fails.
In findSectionHeader() - deal with combine nodes from the section title
Group them together and use the last of them for the text to the next section.
Identify watermark
light color
pdfText() should have docFont = FALSE by default.
getTextByCols() moves the ++ following C/C for C++ because it is at a different vertical position eventhough it is in the correct document order. So fix this.
get docInfo as meta data
[done] margins only gets left and right, not top and bottom.
Add dropHeadFoot parameter for margins.
Make margins smarter to first identify the header and footer and any other extraneous content (e.g. watermark, download information on the side of the page) and then compute the margins. Or identify the contiguous body of the page and focus only on that.
[done] margins() for multiple pages - return a data.frame or loose list of pairs.
margins(, asDataFrame = FALSE)
Enhance isBold, isItalic function that uses the font info and its name to guess when isBold etc are FALSE..
Change default for asDataFrame in getBBox() and getBBox2() to TRUE and fix code in the package and in Zoonotics and other places we are using it. (Lots of places.)
length() generic not triggering on an "ConvertedPDFDoc"
getColPositions("LatestDocs/PDF/2682627390/Adrian Diaz-2008-West Nile virus in birds, Arg.xml")
getNumCols() gives 0 for Adrian ... above. Should be at least 2, but 3 is correct. Now get 2 columns but the 2nd page is quite different. The table throws it.
"NewPDFs/Seoul Virus/Zhang-2009-Hantaviruses in rodents and humans.xml"
Figure out what the coordinates are in an image.
In Easterbrook-2007, p4 The figure has lines and is PDF but we don't see the lines in the plot when we plot that page in R.
FIX fixTitleNodes().
Generally, fix the getDocTitle() function to be more accurate. See Status/TitleStatus.md
XXX pdfText messes up text in different columns. Check.
[check - think done]showNode() for lines|rect. location and color
getNodesBetween(): work with a line/rect node.
Footer is not pulling in all information, e.g. when multiple fonts occur, getPageFooter() only returns one font/bounding box baseline set. Add some "wiggle" room to match close enough bboxes ??Example doc in which problem occurs?
Currently bails out after a "successful" match. May want to return all possible places where there could be a match, but this might result in too many overall matches. Revisit if we have issues with the single match/ current approach.
TextRegex: actually MonthName Year match, does not match any year.
Check if the page number is in footers/headers looks like a date. "1997.pdf"
Check title before abstract, etc. Title is most often correct.
Long dash is converted to A-, messing up some functions but not others.
Notes/Issues:
Title is a lock, abstract seems to be less reliable,
[high] Section headers on the same line but in different columns. 13 Geology-2013-Ballmer-G33804.1.pdf
Too many sections found:
22 Beghein Science-2014-Beghein-science.1246724.xml - bibliography entries
Extra section title in results
Spaces in words, e.g. D ISCUSSION means we don't find the section: 1-50/50 Colombi_et_al-2012-Geophysical_Journal_International.xml
Get the order of the sections correct - CIG_Citation/1-50/1 aagaard jgrb50217.xml The findSectionHeader() function should order the sections according to column, not document order. See Becker-2012
When there are no section headers, except REFERENCES, collect the text from the body into its own unnamed section. See Degiorgis-2000.
Combine text across lines. CGI_Citation/1-50/10 Bae_et_al-2010-Geophysical_Prospecting.xml" Inversion Algorithm in the Laplace Domain Also, 1-50/11 Bakir art%3A10.1007%2Fs00024-012-0482-8.xml
For CIG_Citation, add the appendices, etc. 12 Bakir art%3A10.1007....
Calzolari, Becker-2012 Picking up an author name with the same font as the sections header we were looking for. And the last one is on a line by itself because it is in the second column, but all the other nodes on that line are in the first column. But the real problem is that these nodes span the width of the page. So a) find the gap between the columns (end of one, start of next) and then see which lines do not have this gap between adjacent nodes (in nodeByLine) and then mark these as spanned. Or b) find any line that has text with content in the gap between the columns. So basically fix isOnLineBySelf() and pass it the getTextByCol() Hack: Also remove any node that occurs "before" Abstract.
[Fixed] Figuerola - Conclusions is coming up not on a line by itself but it is. Is it the table's columns throwing us off and going a little too far to the right in defining the start of the second column? We weren't getting the column positions from the page, but from the doc.
"LatestDocs/PDF/1727052847/Tong-2004-Ross River virus disease in Australi.xml" has keywords within the abstract "region"
[works] In Wernery, get the col positions correct. Currently returning -2. Is this a docFont = FALSE issue.
TextRegEx TextRegEx TextRegEx TextRegEx
"July 1999" "26 April 1999" "21 April 1999" "20 July 1999"
5 minutes to process 71 documents.
tt = readRDS("SP_SectionText.rds")
names(tt) = gsub("^../", "", names(tt))
len = sapply(tt, length)
b = tt[len >= 10]
system.time({tmp = lapply(names(b), findSectionHeaders)})
names(tmp) = names(b)
isNumbered = sapply(tmp, function(x) all(grepl("^[0-9]+(\\.[0-9]+)?", sapply(x, xmlValue))))
tmp = tmp[!isNumbered]
b = b[!isNumbered]
order(sapply(tmp, length))
array list NULL try-error XMLNodeSet
31 257 84 5 31
The errors
[1] "LatestDocs/PDF/2828631744/art%253A10.1023%252FA%253A1008199800011.xml" [2] "LatestDocs/PDF/3529243761/Pattnaik-2006-Kyasanur Forest disease_ an epid.xml" [3] "LatestDocs/PDF/2999137579/Wong et al 2007 supplement.xml" [4] "LatestDocs/PDF/0148058638/Wong_et_al-2007-Reviews_in_Medical_Virology.xml" [5] "LatestDocs/PDF/3246714993/bok%253A978-3-540-70962-6.xml"
is.scanned = sapply(sp.xml[sapply(sections, is.null)], function(x) try(isScanned(x))) table(is.scanned)
The documents for which we got NULL for the sections and which are NOT scanned:
sp.xml[sapply(sections, is.null)][!is.scanned] [1] "LatestDocs/PDF/1601876396/OIE Iran.xml" [2] "LatestDocs/PDF/2430316441/OIE Kuwait.xml" [3] "LatestDocs/PDF/3814962940/OIE Oman.xml" [4] "LatestDocs/PDF/3982771992/Leroy-2004-Multiple Ebola virus transmission e.xml" [5] "LatestDocs/PDF/2364497871/leroy et al 2005.xml" [6] "LatestDocs/PDF/1217382941/Barrette-2009-Discovery of swine as a host fo1.xml" [7] "LatestDocs/PDF/4154443567/Barrette-2009-Discovery of swine as a host for.xml" [8] "LatestDocs/PDF/3267708254/Quaglia-2014-West Nile and st. Louis encephali.xml" [9] "LatestDocs/PDF/0818313444/vir.0.81576-0-SuppTableEdited.xml" [10] "LatestDocs/PDF/3342055963/08-0359_appT-s1 (2).xml" [11] "LatestDocs/PDF/1502738312/Lundkvist-1998-Human Dobrava hantavirus infect.xml" [12] "LatestDocs/PDF/0613064798/Plyusnin-1999-Dobrava hantavirus in Russia.xml" ``` 1. For NULL values returned, indicate no sections, perhaps list().
Get keywords as a section.
[fixed] Leroy-2004 - finding "Materials and Methods" under the "Supporting Online Material". For the same paper, we run into this with the "Table S1" - see below.
Brauburger-2012 - get header content page number and year (both look like years)
Develop getPageHeader/getHeader and footer versions.
Venter-2010 - findSectionHeaders() includes the header for the pages "VENTER AND SWANEPOEL" and "WNV LINEAGE 2 PATHOGENESIS"
Blasdell - table 1 is a great example of containing all the data we want.
Also Linke table 3 another example of where the data are that we want.
Klein - gets getColPosition() wrong for perPage = TRUE or FALSE. Hence isOnLineBySelf() fails. And we need that for determining if the section titles are on their own line.
LatestDocs/PDF/0212899111/Levis-2004-Hantavirus pulmonary syndrome in no.xml Matching CA). Check on line by itself.
Matching too many - References??? But more than that.
LatestDocs/PDF/1609915988/McIntosh-1976-Culex (Eumelanomyia) rubinotus T.xml
[This may be correct as it is a 35 page doc with a table of contents] LatestDocs/PDF/0817727758/Klein-2011-Hantaan virus surveillance targetin.xml
Fix isScanned - LatestDocs/PDF/1609915988/McIntosh-1976-Culex (Eumelanomyia) rubinotus T.xml But isScanned() and isScanned2() say no! Hjelle-1995 also scanned. Rudnick also: LatestDocs/PDF/3257936385/Rudnick-1965-Studies of the ecology of Dengue.xml
Getting the author names LatestDocs/PDF/0368782170/Chew-2000-Risk factors for Nipah virus infecti.xml LatestDocs/PDF/0382058825/Rihtaric-2010-Identification of SARS-like Coro.xml
Combine the text on the same line.
Names of the sections have extra spaces within word LatestDocs/PDF/0851236576/Chevalier-2010-Environmental risk factors of W.xml
OKAY - Numbered sections
[check] When combining nodes on a line in, e.g., Forrester-2008, get nodesByLine() correct. The b/< characters have @top=149 & @height=12 and the number have @top=151 & @height=10 We may want to group by @top + @height.
useBase = TRUE in nodesByLine(). How does it perform with superscripts. See Alagaili...2014 and Table 1's column headers.
Read the tables back to data frames, arranging each line into columns, but determining the columns across all lines first.
Read the footnotes. Make sense of them!!
Determine the caption, e.g., above the first line
Remove any footer line that spans the entire page on all pages before looking for tables.
Recognize Table XX in the text as not a table identifier. See Leroy-2004 - "Table S1" at the very end of the article that refers to supporting online material. We'll just end up with 0 rows for the table and can discard.
identify tables and put the related nodes into a table node and then potentially write the result back to the original file so we have that information for subsequent reads of that document.
Look for lines separating rows in tables.
Schmaljohn-1997 - 2 complex tables. one which spans 1 1/4 columns. getTextByCols() is returning 4 elements, but getColPositions() gives just 2. Almost works out of the box, but doesn't include the right-most column. For table that partiall goes into another column, check the line endings of the text within the vertical region and see if there is a big enough gap/margin. Table 1 is on a page all by itself. Its contents are not in the doc font - not a single text element with the doc font on that page. So getTextByCols() and getColPositions() fail to return anything. We need getColPositions(, docFont = FALSE). ADDED NOW
Brauburger-2012 - single column long paper. Tables continued across pages. Kariwa-2007 also continued.
Neel-2010 - a rotated table.
For Neel, page 5: all the text is rotated 90 except 5 nodes which are the header for that page. Can we detect this and then change the bbox to treat x0 as y0 and x1 as y1 and reorder the dimensions of the page.
[!!] "1351986620/J Infect Dis.-2015-Ogawa-infdis-jiv063.xml" - tables with rows with alternating colors.
[used to work, I think] Table 2. Has not Gets the header and first row, but not the remaining rows. The rows have alternative colors. Can we exploit this to identify [previously] Now gets more than we want. Includes line from other column from Figure 3 and much f the caption and then from the 2nd column below the table and the "Downloaded from " which is rotated text.
[works] Table 1 (which comes second)
[table 2 broken] Armien-2004 - good example of table
r
names(findTable(getTables(ar)[[2]]))
r
tt = getTables(nit); findTable(tt[[1]])
[works] Armien-2004 - good example of table
[works]
[no caveats now afer adjusting getTextByCols(), etc. to compute getColPositions() across entire document.]
Table 2. Table2 is not considered centered. getColPositions() returns only 1. This is because
References are in the second colun and are numbered and indented so very few in.
If we specify the column breaks ourself, based on page 4, it works
r
tt = getTables(ar)
names(findTable(tt[[2]], colNodes = getTextByCols(ar[[5]], breaks = c(79, 474), asNodes = TRUE)))
names(findTable(tt[[2]]))
[works] Table 1
[works] Can't detect Klein-2011 - lines don't span all the way across the page. But no text to the right.
But many additional lines.
names(findTable(getNodeSet(k, "//text[contains(., 'Table 1')]")[[1]]))
[works] Weaver-2001 - table 2 - getColPositions() has 5 columns because the table dominates. [this part fixed now.] getColPositions() uses the id of the most common font (getDocFont()) to find the relevant text, and so excludes the tables, etc.
[works] Padula-2002 - 1 table spans 2 columns
[Works] table 3 in Fulhorst - spans width of page.
Thinks there is only one column. So getColPositions() needs work because of the image in the
second column.
r
tt = getTables(fu)
names(findTable(tt[[3]]))
[works] 3 columns: 3982771992/Leroy-2004-Multiple Ebola virus transmission e.xml
[works] NipahAsia
[fixed with perPage = FALSE
] getColPositions() - see Armien-2004 p5.
Section title: Look for text on a line on its own, a little separated from next line and not
taking up the entire column width.
r
findShortLines(getTextByCols(wv[[2]], asNodes = TRUE)[[1]])
Then we see the lines that don't span the entire column and also the ones that start with an
indentation.
See also findShortSectionHeaders()
Compute document-wide interlineskip.
Get all the @top from the text nodes on a page.
Group them by line
order the lines
compute difference
r
ptops = as.numeric(unlist(getNodeSet(wv[[2]], "//text/@top")))
pcut = split(ptops, cut(ptops, seq(0, 1200,by=13)))
pcut = pcut[ sapply(pcut, length) > 0]
diff(sapply(pcut, min))
identify abstract and put it in its own node.
Find text within shaded region. Put the text nodes in a node.
Remove header and footer material from getTextByCols()
[check] Find superscripts that are citations and remove them from the text. See findBibCites()
[check] Group segments that have very close tops together. Implemeted in nodesByLine(). See isCentered() where we combine segments into lines. Move this code out to a separate function.
So getColPositions() needs work because of the image in the second column. See Fulhorst-2002 age 4.
Similarly, can add 1-column, 2-column, etc around the text, which column and where the columns start and end.
In getNodesBetween(), we should arrange the text by line and within line from left to right. See getTextByCols() should do this. We do this for isCentered(). Need to deal with the top values being one or two units apart for segments on the same line.
Identify section starts and ends, i.e. section titles.
Got some extras and missed DISCUSSION in 3234834982/Fulhorst-2002-Natural host relationships and 1.xml Also the s of Merriam's is running into pocket mice. Two itaclic segments on that line. The problem is that isCentered() is failing. The top for this text is 490. There is another set of tops at 489 which are the italic parts. So we need group these properly.
Nothing in 2688324473/Beltrame-2006-Tickborne encephalitis virus, no.xml but makes sense - letter to editor. Same with 4021195741/Shepherd-1987-Antibody to Crimean-Congo hemorr.xml
LatestDocs/PDF/2081396765/Neel-2010-Molecular epidemiology of simian imm.xml gets too much, some from the abstract which spans the two columns but is indented.
LatestDocs/PDF/3385699523/Holzmann-2010-Impact of yellow fever outbreaks.xml
Gets some extra parts, e.g. 2010 Wiley-Liss, Is the
on page 2 on the line with a paragraph indentation - "An outbreak has been
defined" col 1, halfway down.
3136760279/Tauro-2012-Serological detection of St. Louis.xml" is correct, but there are also paragraph titles that are interesting/useful, e.g. Study site, Sample collection which are italics and followed by a - at the start of a paragraph.
3982771992/Leroy-2004-Multiple Ebola virus transmission e.xml - nothing and this is correct.
Good: 2956441632/Cui-2008-Detection of Japanese encephalitis vi.xml
Numbered sections: 1347402211/Luis et al_2014_A comparison of bats and roden.xml Also has valuable sub-section titles.
Numbered: 3512447895/Hara-2005-Isolation and characterization of a1.xml
Look for text at the start of a paragraph that starts with italics or a font. Aguilar-2007
Make isCentered() faster.
Not picking up sub-section titles, intentionally. See 3133228518/Murphy-2006-Implications of simian retroviruse.xml for example.
?Include unnumbered sections in documents with numbered section headers, e.g. Lahm and Acknowledgements, References. Do we care?
[manually check] For Weaver & Lahm, finish getting the text for sections.
When finding section headers, check if the templates we find are centered annd check others that have the same font are also centered. See 3618741902/Armien-2004-High seroprevalence of hantavirus.xml Weaver and Klein also have centered sections.
[check works] in getColPositions() if values are too close together drop the right one. Weaver page 4. 470 and 471. To do with indendented first line of paragraph. Increase threshold. But getTextByCols() has no nodes in the 470 one.
When getting nodes in getTextAfter, recognize tables at the top or bottom of the page and skip
over them. Weaver p6
r
h = findSectionHeaders("LatestDocs/PDF/0629421650/Padula-2002-Andes virus and first case report1.xml")
sapply(getTextAfter(h[[10]], h[[11]]), xmlValue)
Find tables and figures
Implement getHeader and footer. See Lahm-2007 with lines at the top of the page.
Find abstract and if it spans the entire page, don't include it when computing columns.
For 2 or more columns, detect the part which is only one column spanning the entire page.
For getColPositions() take the entire document into account and take the most common. Give the parts after References/Bibliography less weight. These are often indented due to the number so we don't get much text starting at that point See 3618741902/Armien-2004-High seroprevalence of hantavirus.xml
exclude shaded boxes when computing column positions. And images. and tables. See Lahm-2007
The box in the left side of the page doesn't appear to be as wide as in the PDF. This is the keyword box. "Zoonotics/...PDF/0809541268/Kitajima-2009-First detection of genotype 3 he.xml" This comes up in the splitElsevierTitle() and why we put the no filter of nodes if no y > yl.
Get all of the elements in the title even if changed font i.e. identify title and then find all the elements near these that make up the lines. Have to deal with spanning 2 columns and may not be part of the title and many other issues. e.g. 1834853125/394.full.xml
getColPositions: when first line of paragraph is indented, we don't get the critical mass at the same point. See 0337534517/Andriamandimby-2011-Crimean-Congo%20hemorrhagic.xml
Reassemble the elements of a word, line, paragraph from the different elements See nodesByLine()
Detect 2 columns when one is mostly a figure and not words. Figure out columns for all pages and correct if one or two pages seems to be single column.
[done] Rationalize getFontInfo() and fontInfo() functions. fontInfo() gone. getFontInfo() now returns the full data frame and uses the font id as row names.
[done] Error from isScanned2("LatestDocs/PDF/2143276081/Kamhieh-2006-Borna disease virus (BDV) infect1.xml")
[fixed] getDatePublished() for Aguilar-2007 gives NULL but info at the end - April 8, 2006 The version that was in Zoonotics-shared and now in ReadPDF works fine.
[Done] Find font for the majority of the text.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.