Scraping Ameriflux site info

On the flight home from AGU 2015 I realized that the same code that I used to scrape Zooniverse statistics could easily be changed to grab the site summary data from the Ameriflux LBL page. As with the Zooniverse code, it relies on external PhantomJS binaries.

The function returns a data frame with all scraped data (site names, lat/long, altitude etc…). Errors in the table are due to errors in the original data, not the conversion (mainly start and end dates).

I’ll use this function in combination my Ameriflux download tool to provide easier sub-setting of the data. Keep an eye on my blog for upcoming updates to my Ameriflux download tool.

#' Grabs the ameriflux site table from the LBL site
#' @param url: Location of the Ameriflux site table
#' @param path: location of the phantomjs binary (system specific)
#' @keywords Ameriflux, sites, locations, web scraping
#' @export
#' @examples
#' # with defaults, outputting a data frame
#' df <- ameriflux.info()
#' [requires the rvest package for post-processing]
#' http://phantomjs.org/download.html

ameriflux.info <- function(url="http://ameriflux.lbl.gov/sites/site-list-and-pages/",
                           path="~/my.phantom.js.path/"){
  
  # read the required libraries
  require(rvest)
  
  # subroutines for triming leading spaces
  # and converting factors to numeric
  trim.leading <- function (x)  sub("^\\s+", "", x)
  as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
  
  # write out a script phantomjs can process
  # change timeout if the page bounces, seems empty !!!
  writeLines(sprintf("var page = require('webpage').create();
                     page.open('%s', function (status) {
                     if (status !== 'success') {
                     console.log('Unable to load the address!');
                     phantom.exit();
                     } else {
                     window.setTimeout(function () {
                     console.log(page.content);
                     phantom.exit();
                     }, 3000); // Change timeout to render the page
                     }
                     });", url), con="scrape.js")
  
  # process the script with phantomjs / scrapes zooniverse page
  system(sprintf("%s/./phantomjs scrape.js > scrape.html",path),wait=TRUE)
  
  # load html data
  main = read_html("scrape.html")
  
  # set html element selector for the header
  sel_header = 'thead'
  
  # Extract the header data from the html file
  header = html_nodes(main,sel_header) %>% html_text()
  header = unlist(strsplit(header,"\\n"))
  header = unlist(lapply(header,trim.leading))
  header = header[-which(header == "")]
  
  # set html element selector for the table
  sel_data = 'td'
  
  # process the html file and extract stats
  data = html_nodes(main,sel_data) %>% html_text()
  data = matrix(data,length(data)/length(header),length(header),byrow=TRUE)
  df = data.frame(data)
  colnames(df) = header
  
  # reformat variables into correct formats (not strings)
  # this is ugly, needs cleaning up
  df$SITE_ID = as.character(df$SITE_ID)
  df$SITE_NAME = as.character(df$SITE_NAME)
  df$TOWER_BEGAN = as.numeric.factor(df$TOWER_BEGAN)
  df$TOWER_END = as.numeric.factor(df$TOWER_END)
  df$LOCATION_LAT = as.numeric.factor(df$LOCATION_LAT)
  df$LOCATION_LONG = as.numeric.factor(df$LOCATION_LONG)
  df$LOCATION_ELEV = as.numeric.factor(df$LOCATION_ELEV)
  df$MAT = as.numeric.factor(df$MAT)
  df$MAP = as.numeric.factor(df$MAP)
  
  # drop double entries
  df = unique(df)
  
  # drop first row (empty)
  df = df[-1,]
  
  # remove temporary html file and javascript
  file.remove("scrape.html")
  file.remove("scrape.js")
  
  # return data frame
  return(df)
}

 

One week of classifications

The Jungle Rhythms project is running for one week and classifications are coming in steadily. Currently, over 8,000 images have been classified by only a limited amount of users (218). Unless an army of unregistered users are pushing the effort a lot of credit goes to a relatively small but dedicated set of citizen scientists, rather remarkable. In the figure below you see a consistent steady stream of classifications (almost linear over time - on 18/12/2015). On the x-axis you see the date (+ time), on the y-axis you see the total classification count (top panel) and the number of classifications per day (bottom panel). The red vertical bars denote the new registered users who contributed to the project.

This week is also the week of the AGU conference, a the yearly meeting of geoscience scientists in San Francisco. I presented the Jungle Rhythms project and hope this will draw some attention to the project and potentially gather some more contributors.

 

Jungle Rhythms Statistics

First of all, hooray… the first subject in my project retired!!!

Furthermore, in order to track the progress of the project without downloading all classifications I wrote a little web scraper in R which grabs the summary statistics such as:

  • # registered users
  • # classifications
  • # subjects (images to classify)
  • retired subjects (these are finished)

I grab these statistics on a half hourly basis to track progress of the project over time. Every day I make a summary graph of these results. This summary graph can be found on the Jungle Rhythms landing page and below. In this graph I currently only plot the number of classified subjects over time, as well as the number of registered users.

This simple graph should give me an idea on the rate of progress of the project. Check in on a regular basis to see how things progress.

From a pale blue dot to distant frozen world

A little over 25 years ago, on July 6th 1990, the Voyager I space probe turned around and snapped one last picture of planet Earth and the solar system it was about to leave behind. More than 6 billion kilometers from the sun planet earth was only a pale blue dot, barely a pixel large.

This is the “Pale Blue Dot” photograph of the Earth taken by the Voyager 1 spacecraft on July 6, 1990. The Earth is the relatively bright speck of light about halfway across the uppermost sunbeam.[/caption]

 

Look again at that dot. That's here. That's home. That's us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every "superstar," every "supreme leader," every saint and sinner in the history of our species lived there-on a mote of dust suspended in a sunbeam. - Carl Sagan, Pale Blue Dot

Yet, today, at roughly the same distance from the sun, the New Horizons mission delivered the sharpest photo of Pluto at the far edge of the solar system. The picture shows details the size of less than a city block. Again this is a testament to human ingenuity and drive to explore, but once more emphasizes that we are only “a mote of dust suspended in a sunbeam”. Humbling and rather amazing, if you think about it.

 

 

Jungle Rhythms Launch

Today I launch Jungle Rhythms, an online citizen-science project that aims to digitize thousands of pages of detailed observations of the life cycle of trees in Africa.

Belgian scientists were stationed at the Yangambi Research Station in what is now the Democratic Republic of the Congo from 1938 until 1958 as part of an agriculture-based research project. During that time, the scientists – for reasons unknown– also began collecting detailed observations on the life cycle of trees in the local forest. Those observations were kept in a series of notebooks, and later summarized in large tables, which were discovered, nearly 80 years later, stored in an archive under less-than-ideal conditions.

To avoid losing the data as the pages crumbled, I digitized the tables in the hopes of using computers to automatically capture the data, but quickly realized the marks were simply too faint.

While the project’s ultimate goal is to preserve the data for future study, it also gives the public an up-close-and-personal view on how scientific research is conducted. To do so, I’ll be blogging about the project to keep users up to date on new exciting results, and any discoveries I make about the history of the data itself.

Pagination


© 2018. All rights reserved.

Powered by Hydejack v7.5.1