# Processing Jungle Rhythms data: coordinate transformations and line intersections

I mentioned that I started working on processing some annotations to data. This will get me a feeling for the data quality, but more so get me thinking about how to process the data efficiently.

In this blog post I’ll quickly outline the methodology I’ll use to process the annotations. First I have to give a quick summary on what the data looks like once annotated, how to deconstruct this data into usable data.

Below you see a picture of an annotated yearly section. Red dots outline the most common intersection coordinates for the yearly section (green crosses represent all measurements), while green lines represent the annotated life cycle events within the yearly section. Note the accuracy of all the annotations, rather amazing work by everyone who contributed!

With these key locations (red dots) of the yearly section, mainly: the start, middle and end (providing the general orientation of the yearly section within the image), the annotations can translated into true data (compensating for skewness and warping in the picture).

Each yearly section, will be processed a half year at a time using roughly four steps:

1. For each year I make sure the bottom axis (bottom left - bottom middle or bottom middle - bottom right) is aligned along the Cartesian x-axis. This is done by rotating the data around the bottom left or bottom middle point, respectively. -> in this image the axis is relatively close to optimal!
2. Since I only process data within half a yearly section I trim annotated lines to fit each six month period.
3. After trimming the annotations to fit neatly within the first six months I need to transform all these coordinates to days within a year. I know the spacing between the row is equal. As such, the total length of a row can be calculated as the distance of a line which crosses the beginning and end of half a yearly section.
4. What remains is to calculate the distance between the beginning and start of an annotated segment relative to the total length to determine the days they cover during the year.

Finally, all these data will be combined into a matrix. This matrix will then be linked to the original species data, kept in a separate file.

# Scraping Tropicos data in R

Just a little web scraper to mine species distribution data from the Tropicos botanical database.

The script scrapes the first returned result, or if there is only one result it will display the available data (a data frame of all available data). If not distribution data was found the function will return an NA value.

#' @param species: genus species or genus
#' @param quiet: TRUE / FALSE provides verbose output
#' @keywords Tropicos, species distribution
#' @export
#' @examples
#' # with defaults, outputting a data frame with species distribution
#' # for Clematis
#' df <- tropicos.species.distribution()
#' # returns NA if no data are present
#' [requires the rvest package for post-processing]

tropicos.species.distribution <- function(species='Clematis',quiet=TRUE){

require(rvest)

species = gsub(" ", "+", species)
base_url = "http://www.tropicos.org/NameSearch.aspx?name="
url = sprintf("%s%s",base_url,species)

# load the page into memory

# first check if the page lists a table of species
# or was redirected to the only search result presented
check = html_page %>%
html_nodes("#ctl00_footerControl_citationDiv") %>%
html_text()

sel = regexpr("(http://).*[0-9]",check)
final_url = regmatches(check,sel)

# if the length of the final_url is 1 then use the final_url
# to move on, otherwise select the first listed species
if(length(final_url)>0){

print("test")
url = sprintf("%s%s",final_url,"?tab=distribution")

distribution = html_page %>%
html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
html_table() %>%
data.frame()

}else{

# if multiple search results are present, pick the top of the list
# load the first species in the table generated by the species search
html_nodes("#ctl00_MainContentPlaceHolder_nameSearchControl_gridView tr:nth-child(2) td:nth-child(3)") %>%
html_nodes("a") %>%
html_attr("href")
# clean up the string, grab the species indentifier number

# format the new string, linking to the species distribution

# extract the species distribution in full from the page
distribution = html_page %>%
html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
html_table() %>%
data.frame()
}

if(prod(dim(b))==0){
distribution = NA
}

# return the data frame, verbose or not
if (quiet == FALSE){
print(distribution)
}
return(distribution)
}

# Crunching Jungle Rhythms data

I haven’t blogged about Jungle Rhythms in a while. So here is a quick update on things! I’m currently working through the first batch of Jungle Rhythms data.

Although the fully annotated data is not in, I’ve partial data to work on and get an algorithm running. This algorithm would extract the phenological data as annotated by everyone who contributed and turn them into true dates (or weeks of a particular year).

Sadly, the data structure as used by Zooniverse is currently less than ideal. Zooniverse data exports use a comma separated file format (CSV) with JSON content. However, R, in which I do most of my statistics and processing, is rather bad in dealing with JSON data. Even using Python, which handles JSON better, the data structure remains rather cumbersome. I submitted a GitHub ticket (i.e. a way to request features in software or report bugs) raising this concern, and it will be addressed in the near future.

In the mean time, I’ll still code up a processing routine to assess the intermediate results. Hopefully the new data format will make all this a bit more straightforward and transparent.

# Adventures in GPU computing and Deep Learning land

The past weeks I’ve been toying with GPU computing and Deep Learning. This is a brief summary of things I’ve learned in setting things up (hardware and software).

Let’s talk about hardware! I’m currently running everything on a NVIDIA MSI GTX960 4GB Gaming card. Given the cost of the card (~$250) and the fact that for a lot of the Deep Learning applications memory will be the limiting factor (Google: “Check failed: error == cudaSuccess (2 vs. 0) out of memory”) rather than CUDA cores, this is a really good card. Most faster cards (more CUDA cores) will have the same amount of memory. In short, you might gain some time but it does not allow you to train more complex models (larger in size). Unless you upgrade to a NVIDIA GTX Titan X card with 12GB of memory (costing 5x more) or dedicated compute units such as a Tesla K40 you won’t see a substantial memory increase in the product line. On the software end, the documentation on the installation of the Caffe framework is sufficient to get you started (on Ubuntu at least). The documentation is good and the community seems to be responsive, judging from the forum posts, github issues. However, I’ve not engaged in any real trouble shooting requests (as things went well). The only issue I encountered when setting up the Caffe-Segnet implementation was the above mentioned “out of memory error”. Although people mentioned that they got things running on 4GB cards I just couldn’t get it to work. In the end I realized that the memory being used to drive my displays (~500MB) might just tip the scale. Indeed, offloading graphics duties from the dedicated card onto the motherboard’s integrated graphics did free up enough memory to make things work. An important note here is that you need to install the CUDA Toolbox and drivers without their OpenGL drivers. The CUDA Toolbox overwrites the Ubuntu originals and breaks the graphics capabilities of your integrated graphics in the process. To install the CUDA Toolbox correctly, with full memory capabilities first select the integrated graphics (iGPU) as your preferred GPU in your system BIOS. Then download the CUDA Toolbox runtime and run the following command: sudo ./cuda_x.x.x_linux.run --no-opengl-libs This should install all CUDA libraries and drivers while at the same time prevent the Toolbox from overwriting your iGPU OpenGL libraries. In addition you might also want to make sure you initiate the NVIDIA devices on boot (devices should be accessible but not driving any display) by placing the below script in your /etc/init.d/ folder. #!/bin/bash NVDEVS=lspci | grep -i NVIDIA N3D=echo "$NVDEVS" | grep "3D controller" | wc -l
NVGA=echo "$NVDEVS" | grep "VGA compatible controller" | wc -l N=expr$N3D + $NVGA - 1 for i in seq 0$N; do
mknod -m 666 /dev/nvidia$i c 195$i
done

mknod -m 666 /dev/nvidiactl c 195 255


(I’m not quite sure the latter script has any use as the devices might be initiated already anyway. However, I rather list it here for future reference.)

In the end, both the MIT scene recognition model as well as the Caffe-Segnet implementation work. Below you see the output of the Segnet Camvid demo, which uses the input of a webcam and classifies it on the fly into the 12 classes. The output is garbage as the scene does not represent a street view (the original intent of the classifier), but it shows that the system runs at a solid 210ms per classification.

# Deep learning snowy images

Past week I started to play with the Caffe deep learning framework. Although I initially planned on using the SegNet branch of the Caffe framework to classify snow in PhenoCam images. However, given that it concerns a rather binary classification I don’t need to segment the picture (I do not care where the snow in the image is, only if it is present). As such, a more semantic approach could be used.

Luckily people at MIT had already trained a classifier, the Places-CNN, which deals with exactly this problem, characterizing an image scene. So, instead of training my own classifier I gave theirs a try. Depending on the image type, and mostly the view angle the results are very encouraging (even with their stock model).

For example, the below image got classified as: mountain snowy, ski slope, snowfield, valley, ski_resort. This all seems very reasonable indeed. Classifying a year worth of images at this site yielded an accuracy of  89% (compared to human observations).

However, when the vantage point changes so does the accuracy of the classification, mainly due to the lack of images of this sort in the original training data set I presume. The image below was classified as: rainforest, tree farm, snowy mountain, mountain, cultivated field. As expected, the classification accuracy dropped to a mere 13%. There is still room for improvement using PhenoCam based training data. But, building upon the work by the group at MIT should make these improvements easier.