Processing Jungle Rhythms data: intermediate results

After a few days of struggling with R code I have the first results of processed data at my fingertips! Below you see a 3 window plot which shows the original image on top, the annotated image in the middle and the final extracted data at the bottom. The middle window gives you an idea of the geometry involved in the calculation of the final data at the bottom. I’ll shortly detail my approach, as I abandoned the idea I proposed in a previous blog post.

Using the most common six coordinates for each yearly sections (red dots overlaying the individual markings as green crosses - middle window) I calculate the approximate location of the rows as marked by the red dots on the vertical bold lines (line - circle intersection). All annotations are then projected on to this ideal row (projection of a point onto a line), rendering the coloured lines for different observation types. Finally, with the overall distance of each given row (on a half year basis - point to point distance) I calculate the location in time which is occupied by an annotation. Classifying the lines into life cycle event types is done by minimizing the distance to the ideal row.

All images are shown to 10 independent citizen scientists, and for each individual annotation these values are summed. Where there is a high degree of agreement among citizen scientists I will see a larger total sum. If there is an unanimous agreement among them the sum would be 10 for a given day of year (DOY). As not all subjects are retired, the value as displayed below only has a maximum count of 8. The spread around the edges of the annotations is due to variability in the classifications.

You already notice some pattens in the life cycle events. More on these patterns in a later post, when I can match the data with species names.


Processing Jungle Rhythms data: coordinate transformations and line intersections

I mentioned that I started working on processing some annotations to data. This will get me a feeling for the data quality, but more so get me thinking about how to process the data efficiently.

In this blog post I’ll quickly outline the methodology I’ll use to process the annotations. First I have to give a quick summary on what the data looks like once annotated, how to deconstruct this data into usable data.

Below you see a picture of an annotated yearly section. Red dots outline the most common intersection coordinates for the yearly section (green crosses represent all measurements), while green lines represent the annotated life cycle events within the yearly section. Note the accuracy of all the annotations, rather amazing work by everyone who contributed!

With these key locations (red dots) of the yearly section, mainly: the start, middle and end (providing the general orientation of the yearly section within the image), the annotations can translated into true data (compensating for skewness and warping in the picture).

Each yearly section, will be processed a half year at a time using roughly four steps:

  1. For each year I make sure the bottom axis (bottom left - bottom middle or bottom middle - bottom right) is aligned along the Cartesian x-axis. This is done by rotating the data around the bottom left or bottom middle point, respectively. -> in this image the axis is relatively close to optimal!
  2. Since I only process data within half a yearly section I trim annotated lines to fit each six month period.
  3. After trimming the annotations to fit neatly within the first six months I need to transform all these coordinates to days within a year. I know the spacing between the row is equal. As such, the total length of a row can be calculated as the distance of a line which crosses the beginning and end of half a yearly section.
  4. What remains is to calculate the distance between the beginning and start of an annotated segment relative to the total length to determine the days they cover during the year.

Finally, all these data will be combined into a matrix. This matrix will then be linked to the original species data, kept in a separate file.

Scraping Tropicos data in R

Just a little web scraper to mine species distribution data from the Tropicos botanical database.

The script scrapes the first returned result, or if there is only one result it will display the available data (a data frame of all available data). If not distribution data was found the function will return an NA value.

#' @param species: genus species or genus
#' @param quiet: TRUE / FALSE provides verbose output
#' @keywords Tropicos, species distribution
#' @export
#' @examples
#' # with defaults, outputting a data frame with species distribution
#' # for Clematis
#' df <- tropicos.species.distribution()
#' # returns NA if no data are present
#' [requires the rvest package for post-processing]

tropicos.species.distribution <- function(species='Clematis',quiet=TRUE){
  # read the required libraries
  species = gsub(" ", "+", species)
  base_url = ""
  url = sprintf("%s%s",base_url,species)
  # load the page into memory
  html_page = read_html(url)
  # first check if the page lists a table of species
  # or was redirected to the only search result presented
  check = html_page %>%
    html_nodes("#ctl00_footerControl_citationDiv") %>%
  sel = regexpr("(http://).*[0-9]",check)
  final_url = regmatches(check,sel)
  # if the length of the final_url is 1 then use the final_url
  # to move on, otherwise select the first listed species
  # and move to this page to list the distribution
    url = sprintf("%s%s",final_url,"?tab=distribution")
    html_page = read_html(url)
    distribution = html_page %>% 
      html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
      html_table() %>%
    # if multiple search results are present, pick the top of the list
    # and follow this link to the distribution data if available
    # load the first species in the table generated by the species search
    first_link_listed = html_page %>% 
      html_nodes("#ctl00_MainContentPlaceHolder_nameSearchControl_gridView tr:nth-child(2) td:nth-child(3)") %>%
      html_nodes("a") %>%
    # clean up the string, grab the species indentifier number
    first_link_listed = gsub("[/Name/]", "", first_link_listed)
    # format the new string, linking to the species distribution
    url = sprintf("",first_link_listed,"?tab=distribution")
    html_page = read_html(url)
    # extract the species distribution in full from the page
    distribution = html_page %>% 
      html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
      html_table() %>%
    distribution = NA
  # return the data frame, verbose or not
  if (quiet == FALSE){


Crunching Jungle Rhythms data

I haven’t blogged about Jungle Rhythms in a while. So here is a quick update on things! I’m currently working through the first batch of Jungle Rhythms data.

Although the fully annotated data is not in, I’ve partial data to work on and get an algorithm running. This algorithm would extract the phenological data as annotated by everyone who contributed and turn them into true dates (or weeks of a particular year).

Sadly, the data structure as used by Zooniverse is currently less than ideal. Zooniverse data exports use a comma separated file format (CSV) with JSON content. However, R, in which I do most of my statistics and processing, is rather bad in dealing with JSON data. Even using Python, which handles JSON better, the data structure remains rather cumbersome. I submitted a GitHub ticket (i.e. a way to request features in software or report bugs) raising this concern, and it will be addressed in the near future.

In the mean time, I’ll still code up a processing routine to assess the intermediate results. Hopefully the new data format will make all this a bit more straightforward and transparent.



Adventures in GPU computing and Deep Learning land

The past weeks I’ve been toying with GPU computing and Deep Learning. This is a brief summary of things I’ve learned in setting things up (hardware and software).

Let’s talk about hardware! I’m currently running everything on a NVIDIA MSI GTX960 4GB Gaming card. Given the cost of the card (~$250) and the fact that for a lot of the Deep Learning applications memory will be the limiting factor (Google: “Check failed: error == cudaSuccess (2 vs. 0) out of memory”) rather than CUDA cores, this is a really good card. Most faster cards (more CUDA cores) will have the same amount of memory. In short, you might gain some time but it does not allow you to train more complex models (larger in size). Unless you upgrade to a NVIDIA GTX Titan X card with 12GB of memory (costing 5x more) or dedicated compute units such as a Tesla K40 you won’t see a substantial memory increase in the product line.

On the software end, the documentation on the installation of the Caffe framework is sufficient to get you started (on Ubuntu at least). The documentation is good and the community seems to be responsive, judging from the forum posts, github issues. However, I’ve not engaged in any real trouble shooting requests (as things went well). The only issue I encountered when setting up the Caffe-Segnet implementation was the above mentioned “out of memory error”. Although people mentioned that they got things running on 4GB cards I just couldn’t get it to work. In the end I realized that the memory being used to drive my displays (~500MB) might just tip the scale.

Indeed, offloading graphics duties from the dedicated card onto the motherboard’s integrated graphics did free up enough memory to make things work. An important note here is that you need to install the CUDA Toolbox and drivers without their OpenGL drivers. The CUDA Toolbox overwrites the Ubuntu originals and breaks the graphics capabilities of your integrated graphics in the process.

To install the CUDA Toolbox correctly, with full memory capabilities first select the integrated graphics (iGPU) as your preferred GPU in your system BIOS. Then download the CUDA Toolbox runtime and run the following command:

sudo ./ --no-opengl-libs

This should install all CUDA libraries and drivers while at the same time prevent the Toolbox from overwriting your iGPU OpenGL libraries. In addition you might also want to make sure you initiate the NVIDIA devices on boot (devices should be accessible but not driving any display) by placing the below script in your /etc/init.d/ folder.


NVDEVS=`lspci | grep -i NVIDIA`
N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

N=`expr $N3D + $NVGA - 1`
for i in `seq 0 $N`; do
   mknod -m 666 /dev/nvidia$i c 195 $i

mknod -m 666 /dev/nvidiactl c 195 255

(I’m not quite sure the latter script has any use as the devices might be initiated already anyway. However, I rather list it here for future reference.)

In the end, both the MIT scene recognition model as well as the Caffe-Segnet implementation work. Below you see the output of the Segnet Camvid demo, which uses the input of a webcam and classifies it on the fly into the 12 classes. The output is garbage as the scene does not represent a street view (the original intent of the classifier), but it shows that the system runs at a solid 210ms per classification.




© 2018. All rights reserved.

Powered by Hydejack v7.5.1