Scraping Ameriflux site info

On the flight home from AGU 2015 I realized that the same code that I used to scrape Zooniverse statistics could easily be changed to grab the site summary data from the Ameriflux LBL page. As with the Zooniverse code, it relies on external PhantomJS binaries.

The function returns a data frame with all scraped data (site names, lat/long, altitude etc…). Errors in the table are due to errors in the original data, not the conversion (mainly start and end dates).

I’ll use this function in combination my Ameriflux download tool to provide easier sub-setting of the data. Keep an eye on my blog for upcoming updates to my Ameriflux download tool.

#' Grabs the ameriflux site table from the LBL site
#' @param url: Location of the Ameriflux site table
#' @param path: location of the phantomjs binary (system specific)
#' @keywords Ameriflux, sites, locations, web scraping
#' @export
#' @examples
#' # with defaults, outputting a data frame
#' df <-
#' [requires the rvest package for post-processing]
#' <- function(url="",
  # read the required libraries
  # subroutines for triming leading spaces
  # and converting factors to numeric
  trim.leading <- function (x)  sub("^\\s+", "", x)
  as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
  # write out a script phantomjs can process
  # change timeout if the page bounces, seems empty !!!
  writeLines(sprintf("var page = require('webpage').create();
           '%s', function (status) {
                     if (status !== 'success') {
                     console.log('Unable to load the address!');
                     } else {
                     window.setTimeout(function () {
                     }, 3000); // Change timeout to render the page
                     });", url), con="scrape.js")
  # process the script with phantomjs / scrapes zooniverse page
  system(sprintf("%s/./phantomjs scrape.js > scrape.html",path),wait=TRUE)
  # load html data
  main = read_html("scrape.html")
  # set html element selector for the header
  sel_header = 'thead'
  # Extract the header data from the html file
  header = html_nodes(main,sel_header) %>% html_text()
  header = unlist(strsplit(header,"\\n"))
  header = unlist(lapply(header,trim.leading))
  header = header[-which(header == "")]
  # set html element selector for the table
  sel_data = 'td'
  # process the html file and extract stats
  data = html_nodes(main,sel_data) %>% html_text()
  data = matrix(data,length(data)/length(header),length(header),byrow=TRUE)
  df = data.frame(data)
  colnames(df) = header
  # reformat variables into correct formats (not strings)
  # this is ugly, needs cleaning up
  df$SITE_ID = as.character(df$SITE_ID)
  df$SITE_NAME = as.character(df$SITE_NAME)
  df$TOWER_BEGAN = as.numeric.factor(df$TOWER_BEGAN)
  df$TOWER_END = as.numeric.factor(df$TOWER_END)
  df$LOCATION_LAT = as.numeric.factor(df$LOCATION_LAT)
  df$LOCATION_LONG = as.numeric.factor(df$LOCATION_LONG)
  df$LOCATION_ELEV = as.numeric.factor(df$LOCATION_ELEV)
  df$MAT = as.numeric.factor(df$MAT)
  df$MAP = as.numeric.factor(df$MAP)
  # drop double entries
  df = unique(df)
  # drop first row (empty)
  df = df[-1,]
  # remove temporary html file and javascript
  # return data frame


© 2018. All rights reserved.

Powered by Hydejack v7.5.1