# Scraping Zooniverse statistics

In order to keep track of the Jungle Rhythms project I wanted some basic summary statistics, as shown on the front page of the project. However, the front end API of the project does not allow these basic statistics to be pulled from a database. Furthermore, fetching all the project data can only be done once a day (to prevent heavy traffic on the database), keeping me from generating these statistics myself. Still, I want to keep track of how classifications and users change across time.

So, I wrote a web scraper in R which I run every half hour. It renders the page using PhantomJS, as it is a dynamic page. It then grabs the resulting html file and puts it through the rvest R package to extract all necessary (time stamped) elements and writes everything to file. It updates a file if it exists. You can find the code (an R function) below.

#' Grab basic zooniverse statistics from the front page of a project
#' @param url: Location of zooniverse project
#' @param file: the name of the output file to export statistics to
#' @param path: location of the phantomjs binary (system specific)
#' @keywords zooniverse, statistics, web scraping
#' @export
#' @examples
#' with defaults returns a file called users.stats.csv
#' for the Jungle Rhythms project
#' zooniverse.info()
#' [requires the rvest package for post-processing]
#'

zooniverse.info <- function(url="http://www.zooniverse.org/projects/khufkens/jungle-rhythms/home",
file="user.stats.csv",
path="~/your.phanthom.js.location/"){

require(rvest)

# grab current date and time (a time stamp)
date = format(Sys.Date(),"%Y-%m-%d")
time = format(Sys.time(),"%H:%M")

# write out a script phantomjs can process
# change timeout if the page bounces, seems empty !!!
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function (status) {
if (status !== 'success') {
phantom.exit();
} else {
window.setTimeout(function () {
console.log(page.content);
phantom.exit();
}, 3000); // Change timeout to render page
}
});", url), con="scrape.js")

# process the script with phantomjs / scrapes zooniverse page
system(sprintf("%s/./phantomjs scrape.js > scrape.html",path),wait=TRUE)

# load the retrieved rendered javascript page

# set html element selector (which html fields to retrieve)

# process the html file using selection and render as text
data = html_nodes(main,sel) %>% html_text()

# if data is retrieved, append to user.stats.csv file
# if this fails, you most likely need more time to render
# the page (see timeout above)
if (!identical(data, character(0))){

# kick out description fields and convert to numeric
data = as.numeric(data[-c(2,4,6,8)])

# merge into dataframe
data = data.frame(date, time, t(data))
colnames(data) = c('date','time','registerd_users',
'classifications','subjects','retired_subjects')

# append stats with the current date and time
# to an already existing data file
if (file.exists("user.stats.csv")){
write.table(data,"user.stats.csv",quote=F,row.names=F,col.names=F,append=T)
}else{
write.table(data,"user.stats.csv",quote=F,row.names=F,col.names=T)
}
}

# remove html file and javascript
file.remove("scrape.html")
file.remove("scrape.js")
}

# Scraping Ameriflux site info

On the flight home from AGU 2015 I realized that the same code that I used to scrape Zooniverse statistics could easily be changed to grab the site summary data from the Ameriflux LBL page. As with the Zooniverse code, it relies on external PhantomJS binaries.

The function returns a data frame with all scraped data (site names, lat/long, altitude etc…). Errors in the table are due to errors in the original data, not the conversion (mainly start and end dates).

I’ll use this function in combination my Ameriflux download tool to provide easier sub-setting of the data. Keep an eye on my blog for upcoming updates to my Ameriflux download tool.

#' Grabs the ameriflux site table from the LBL site
#' @param url: Location of the Ameriflux site table
#' @param path: location of the phantomjs binary (system specific)
#' @keywords Ameriflux, sites, locations, web scraping
#' @export
#' @examples
#' # with defaults, outputting a data frame
#' df <- ameriflux.info()
#' [requires the rvest package for post-processing]

ameriflux.info <- function(url="http://ameriflux.lbl.gov/sites/site-list-and-pages/",
path="~/my.phantom.js.path/"){

require(rvest)

# subroutines for triming leading spaces
# and converting factors to numeric
trim.leading <- function (x)  sub("^\\s+", "", x)
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}

# write out a script phantomjs can process
# change timeout if the page bounces, seems empty !!!
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function (status) {
if (status !== 'success') {
phantom.exit();
} else {
window.setTimeout(function () {
console.log(page.content);
phantom.exit();
}, 3000); // Change timeout to render the page
}
});", url), con="scrape.js")

# process the script with phantomjs / scrapes zooniverse page
system(sprintf("%s/./phantomjs scrape.js > scrape.html",path),wait=TRUE)

# set html element selector for the header

# Extract the header data from the html file

# set html element selector for the table
sel_data = 'td'

# process the html file and extract stats
data = html_nodes(main,sel_data) %>% html_text()
df = data.frame(data)

# reformat variables into correct formats (not strings)
# this is ugly, needs cleaning up
df$SITE_ID = as.character(df$SITE_ID)
df$SITE_NAME = as.character(df$SITE_NAME)
df$TOWER_BEGAN = as.numeric.factor(df$TOWER_BEGAN)
df$TOWER_END = as.numeric.factor(df$TOWER_END)
df$LOCATION_LAT = as.numeric.factor(df$LOCATION_LAT)
df$LOCATION_LONG = as.numeric.factor(df$LOCATION_LONG)
df$LOCATION_ELEV = as.numeric.factor(df$LOCATION_ELEV)
df$MAT = as.numeric.factor(df$MAT)
df$MAP = as.numeric.factor(df$MAP)

# drop double entries
df = unique(df)

# drop first row (empty)
df = df[-1,]

# remove temporary html file and javascript
file.remove("scrape.html")
file.remove("scrape.js")

# return data frame
return(df)
}

# One week of classifications

The Jungle Rhythms project is running for one week and classifications are coming in steadily. Currently, over 8,000 images have been classified by only a limited amount of users (218). Unless an army of unregistered users are pushing the effort a lot of credit goes to a relatively small but dedicated set of citizen scientists, rather remarkable. In the figure below you see a consistent steady stream of classifications (almost linear over time - on 18/12/2015). On the x-axis you see the date (+ time), on the y-axis you see the total classification count (top panel) and the number of classifications per day (bottom panel). The red vertical bars denote the new registered users who contributed to the project.

This week is also the week of the AGU conference, a the yearly meeting of geoscience scientists in San Francisco. I presented the Jungle Rhythms project and hope this will draw some attention to the project and potentially gather some more contributors.

# Jungle Rhythms Statistics

First of all, hooray… the first subject in my project retired!!!

Furthermore, in order to track the progress of the project without downloading all classifications I wrote a little web scraper in R which grabs the summary statistics such as:

• # registered users
• # classifications
• # subjects (images to classify)
• retired subjects (these are finished)

I grab these statistics on a half hourly basis to track progress of the project over time. Every day I make a summary graph of these results. This summary graph can be found on the Jungle Rhythms landing page and below. In this graph I currently only plot the number of classified subjects over time, as well as the number of registered users.

This simple graph should give me an idea on the rate of progress of the project. Check in on a regular basis to see how things progress.

# From a pale blue dot to distant frozen world

A little over 25 years ago, on July 6th 1990, the Voyager I space probe turned around and snapped one last picture of planet Earth and the solar system it was about to leave behind. More than 6 billion kilometers from the sun planet earth was only a pale blue dot, barely a pixel large.

This is the “Pale Blue Dot” photograph of the Earth taken by the Voyager 1 spacecraft on July 6, 1990. The Earth is the relatively bright speck of light about halfway across the uppermost sunbeam.[/caption]

Look again at that dot. That's here. That's home. That's us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every "superstar," every "supreme leader," every saint and sinner in the history of our species lived there-on a mote of dust suspended in a sunbeam. - Carl Sagan, Pale Blue Dot

Yet, today, at roughly the same distance from the sun, the New Horizons mission delivered the sharpest photo of Pluto at the far edge of the solar system. The picture shows details the size of less than a city block. Again this is a testament to human ingenuity and drive to explore, but once more emphasizes that we are only “a mote of dust suspended in a sunbeam”. Humbling and rather amazing, if you think about it.