Why Jungle Rhythms’ pages are yellow

One of the main reasons to digitize the large summary tables used within the Jungle Rhythms project was to save a digital copy of the physical original. The original copy as stored in the Yangambi herbarium was in a dire state, and preservation was a primary concern. Certain paper, when not stored properly, has the tendency to yellow and slowly become brittle and decay as time passes.

Mainly, newsprint and some other papers are made from ground wood or straw contain more non-cellulose compounds (lignin) compared to cellulose in European paper (pure cellulose) or grass fibers in Asian style paper.  These non-cellulose compounds and lignin in particular are subject to oxidation, which alters their molecular structure by turning them into amongst others phenolic acids, and changing the colors we perceive to yellow and brown.

These phenolic acids not only make the paper yellow, they also make it brittle. These days paper will have either all lignin removed, or will contain an alkaline substance (calcium bicarbonate) to neutralize any lignin left, this kind of paper is called acid-free paper.

Luckily the paper used for the summary tables was sufficiently thick to keep it from falling apart due to oxidation. However, even so it was not safe from water damage or the occasional rodent looking for nesting material. The current state of the paper is therefor a mix of internal paper chemistry, adsorption of dust and dirt, water damage and mechanical damage / stress (rodents, manipulation,…).

(sections copy edited from the Smithsonian Institution Archives. For more information see this research abstract and the series of articles.)

 

Warm winter weather confuses plants

Warm December weather dominates in both the north eastern part of the US as well as large parts of Europe. At both sides of the ocean this warm weather makes plants equally confused.

Many plants are blooming in New York’s botanical gardens and Boston parks. It’s unlikely that most perennial plants will suffer irreparable damage. However, Belgian fruit farmers fear that an untimely frost on the unhardened fruit trees might have serious consequences for both the tree vigour and fruit yields (see movie below).

Plants can withstand frost, however, the continuous warm weather in Belgium has left many trees not acclimated to true winter conditions. A sudden return to normal, freezing, winter conditions could cause frost damage to tissues otherwise protected by a tree’s natural anti-freeze, e.g. sugars in living tissue.

With no real frost days in Belgium the effect of this mild winter might even extend into next year’s spring. Many trees need a certain amount of chilling days, or days which are sufficiently cold / freezing, to trigger a proper leaf-out response next spring. High spring temperatures should move the development of leaves towards earlier start dates. However, a lack of chilling days has shown to delay this expected response to warmer spring temperatures. Warm winter temperatures therefore not only pose an immediate risk, due to sudden freezing of tissue, but have delayed consequences which extend into the next growing season.

(Header image: early leafing Sambucus nigra)

PhenoCam grassland paper accepted

I haven’t made much noise about this but the grassland study I submitted to Nature Climate Change was accepted. I’m working on the final edits to resubmit after Christmas. I’m rather happy with this outcome. For more details I suggest to keep an eye on the Nature Climate Change portal. I will discuss the work in detail after the press moratorium.

With the upcoming data paper, presenting an extensive dataset of curated PhenoCam data, there might be enough data to do more grassland related work as many grassland sites were added, where I was rather data limited before.

Scraping Zooniverse statistics

In order to keep track of the Jungle Rhythms project I wanted some basic summary statistics, as shown on the front page of the project. However, the front end API of the project does not allow these basic statistics to be pulled from a database. Furthermore, fetching all the project data can only be done once a day (to prevent heavy traffic on the database), keeping me from generating these statistics myself. Still, I want to keep track of how classifications and users change across time.

So, I wrote a web scraper in R which I run every half hour. It renders the page using PhantomJS, as it is a dynamic page. It then grabs the resulting html file and puts it through the rvest R package to extract all necessary (time stamped) elements and writes everything to file. It updates a file if it exists. You can find the code (an R function) below.

#' Grab basic zooniverse statistics from the front page of a project
#' @param url: Location of zooniverse project
#' @param file: the name of the output file to export statistics to
#' @param path: location of the phantomjs binary (system specific)
#' @keywords zooniverse, statistics, web scraping
#' @export
#' @examples
#' with defaults returns a file called users.stats.csv
#' for the Jungle Rhythms project
#' zooniverse.info()
#' [requires the rvest package for post-processing]
#' [http://phantomjs.org/download.html]
#' 

zooniverse.info <- function(url="http://www.zooniverse.org/projects/khufkens/jungle-rhythms/home",
                                  file="user.stats.csv",
                                  path="~/your.phanthom.js.location/"){
  
  # read the required libraries
  require(rvest)
  
  # grab current date and time (a time stamp)
  date = format(Sys.Date(),"%Y-%m-%d") 
  time = format(Sys.time(),"%H:%M")
    
  # write out a script phantomjs can process
  # change timeout if the page bounces, seems empty !!!
  writeLines(sprintf("var page = require('webpage').create();
                     page.open('%s', function (status) {
                     if (status !== 'success') {
                     console.log('Unable to load the address!');
                     phantom.exit();
                     } else {
                     window.setTimeout(function () {
                     console.log(page.content);
                     phantom.exit();
                     }, 3000); // Change timeout to render page
                     }
                     });", url), con="scrape.js")

  # process the script with phantomjs / scrapes zooniverse page
  system(sprintf("%s/./phantomjs scrape.js > scrape.html",path),wait=TRUE)
  
  # load the retrieved rendered javascript page
  main = read_html("scrape.html")
  
  # set html element selector (which html fields to retrieve)
  sel = '.project-metadata-stat div'
  
  # process the html file using selection and render as text
  data = html_nodes(main,sel) %>% html_text()
  
  # if data is retrieved, append to user.stats.csv file
  # if this fails, you most likely need more time to render
  # the page (see timeout above)
  if (!identical(data, character(0))){
    
    # kick out description fields and convert to numeric
    data = as.numeric(data[-c(2,4,6,8)]) 
    
    # merge into dataframe
    data = data.frame(date, time, t(data))
    colnames(data) = c('date','time','registerd_users',
                       'classifications','subjects','retired_subjects')
    
    # append stats with the current date and time
    # to an already existing data file
    if (file.exists("user.stats.csv")){
      write.table(data,"user.stats.csv",quote=F,row.names=F,col.names=F,append=T)
    }else{
      write.table(data,"user.stats.csv",quote=F,row.names=F,col.names=T)
    }
  }
  
  # remove html file and javascript
  file.remove("scrape.html")
  file.remove("scrape.js")
}

 

Scraping Ameriflux site info

On the flight home from AGU 2015 I realized that the same code that I used to scrape Zooniverse statistics could easily be changed to grab the site summary data from the Ameriflux LBL page. As with the Zooniverse code, it relies on external PhantomJS binaries.

The function returns a data frame with all scraped data (site names, lat/long, altitude etc…). Errors in the table are due to errors in the original data, not the conversion (mainly start and end dates).

I’ll use this function in combination my Ameriflux download tool to provide easier sub-setting of the data. Keep an eye on my blog for upcoming updates to my Ameriflux download tool.

#' Grabs the ameriflux site table from the LBL site
#' @param url: Location of the Ameriflux site table
#' @param path: location of the phantomjs binary (system specific)
#' @keywords Ameriflux, sites, locations, web scraping
#' @export
#' @examples
#' # with defaults, outputting a data frame
#' df <- ameriflux.info()
#' [requires the rvest package for post-processing]
#' http://phantomjs.org/download.html

ameriflux.info <- function(url="http://ameriflux.lbl.gov/sites/site-list-and-pages/",
                           path="~/my.phantom.js.path/"){
  
  # read the required libraries
  require(rvest)
  
  # subroutines for triming leading spaces
  # and converting factors to numeric
  trim.leading <- function (x)  sub("^\\s+", "", x)
  as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
  
  # write out a script phantomjs can process
  # change timeout if the page bounces, seems empty !!!
  writeLines(sprintf("var page = require('webpage').create();
                     page.open('%s', function (status) {
                     if (status !== 'success') {
                     console.log('Unable to load the address!');
                     phantom.exit();
                     } else {
                     window.setTimeout(function () {
                     console.log(page.content);
                     phantom.exit();
                     }, 3000); // Change timeout to render the page
                     }
                     });", url), con="scrape.js")
  
  # process the script with phantomjs / scrapes zooniverse page
  system(sprintf("%s/./phantomjs scrape.js > scrape.html",path),wait=TRUE)
  
  # load html data
  main = read_html("scrape.html")
  
  # set html element selector for the header
  sel_header = 'thead'
  
  # Extract the header data from the html file
  header = html_nodes(main,sel_header) %>% html_text()
  header = unlist(strsplit(header,"\\n"))
  header = unlist(lapply(header,trim.leading))
  header = header[-which(header == "")]
  
  # set html element selector for the table
  sel_data = 'td'
  
  # process the html file and extract stats
  data = html_nodes(main,sel_data) %>% html_text()
  data = matrix(data,length(data)/length(header),length(header),byrow=TRUE)
  df = data.frame(data)
  colnames(df) = header
  
  # reformat variables into correct formats (not strings)
  # this is ugly, needs cleaning up
  df$SITE_ID = as.character(df$SITE_ID)
  df$SITE_NAME = as.character(df$SITE_NAME)
  df$TOWER_BEGAN = as.numeric.factor(df$TOWER_BEGAN)
  df$TOWER_END = as.numeric.factor(df$TOWER_END)
  df$LOCATION_LAT = as.numeric.factor(df$LOCATION_LAT)
  df$LOCATION_LONG = as.numeric.factor(df$LOCATION_LONG)
  df$LOCATION_ELEV = as.numeric.factor(df$LOCATION_ELEV)
  df$MAT = as.numeric.factor(df$MAT)
  df$MAP = as.numeric.factor(df$MAP)
  
  # drop double entries
  df = unique(df)
  
  # drop first row (empty)
  df = df[-1,]
  
  # remove temporary html file and javascript
  file.remove("scrape.html")
  file.remove("scrape.js")
  
  # return data frame
  return(df)
}

 

Pagination


© 2018. All rights reserved.

Powered by Hydejack v7.5.1