# Scraping Tropicos data in R

Just a little web scraper to mine species distribution data from the Tropicos botanical database.

The script scrapes the first returned result, or if there is only one result it will display the available data (a data frame of all available data). If not distribution data was found the function will return an NA value.

#' @param species: genus species or genus
#' @param quiet: TRUE / FALSE provides verbose output
#' @keywords Tropicos, species distribution
#' @export
#' @examples
#' # with defaults, outputting a data frame with species distribution
#' # for Clematis
#' df <- tropicos.species.distribution()
#' # returns NA if no data are present
#' [requires the rvest package for post-processing]

tropicos.species.distribution <- function(species='Clematis',quiet=TRUE){

require(rvest)

species = gsub(" ", "+", species)
base_url = "http://www.tropicos.org/NameSearch.aspx?name="
url = sprintf("%s%s",base_url,species)

# load the page into memory

# first check if the page lists a table of species
# or was redirected to the only search result presented
check = html_page %>%
html_nodes("#ctl00_footerControl_citationDiv") %>%
html_text()

sel = regexpr("(http://).*[0-9]",check)
final_url = regmatches(check,sel)

# if the length of the final_url is 1 then use the final_url
# to move on, otherwise select the first listed species
if(length(final_url)>0){

print("test")
url = sprintf("%s%s",final_url,"?tab=distribution")

distribution = html_page %>%
html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
html_table() %>%
data.frame()

}else{

# if multiple search results are present, pick the top of the list
# load the first species in the table generated by the species search
html_nodes("#ctl00_MainContentPlaceHolder_nameSearchControl_gridView tr:nth-child(2) td:nth-child(3)") %>%
html_nodes("a") %>%
html_attr("href")
# clean up the string, grab the species indentifier number

# format the new string, linking to the species distribution

# extract the species distribution in full from the page
distribution = html_page %>%
html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
html_table() %>%
data.frame()
}

if(prod(dim(b))==0){
distribution = NA
}

# return the data frame, verbose or not
if (quiet == FALSE){
print(distribution)
}
return(distribution)
}

# Crunching Jungle Rhythms data

I haven’t blogged about Jungle Rhythms in a while. So here is a quick update on things! I’m currently working through the first batch of Jungle Rhythms data.

Although the fully annotated data is not in, I’ve partial data to work on and get an algorithm running. This algorithm would extract the phenological data as annotated by everyone who contributed and turn them into true dates (or weeks of a particular year).

Sadly, the data structure as used by Zooniverse is currently less than ideal. Zooniverse data exports use a comma separated file format (CSV) with JSON content. However, R, in which I do most of my statistics and processing, is rather bad in dealing with JSON data. Even using Python, which handles JSON better, the data structure remains rather cumbersome. I submitted a GitHub ticket (i.e. a way to request features in software or report bugs) raising this concern, and it will be addressed in the near future.

In the mean time, I’ll still code up a processing routine to assess the intermediate results. Hopefully the new data format will make all this a bit more straightforward and transparent.

# Adventures in GPU computing and Deep Learning land

The past weeks I’ve been toying with GPU computing and Deep Learning. This is a brief summary of things I’ve learned in setting things up (hardware and software).

Let’s talk about hardware! I’m currently running everything on a NVIDIA MSI GTX960 4GB Gaming card. Given the cost of the card (~$250) and the fact that for a lot of the Deep Learning applications memory will be the limiting factor (Google: “Check failed: error == cudaSuccess (2 vs. 0) out of memory”) rather than CUDA cores, this is a really good card. Most faster cards (more CUDA cores) will have the same amount of memory. In short, you might gain some time but it does not allow you to train more complex models (larger in size). Unless you upgrade to a NVIDIA GTX Titan X card with 12GB of memory (costing 5x more) or dedicated compute units such as a Tesla K40 you won’t see a substantial memory increase in the product line. On the software end, the documentation on the installation of the Caffe framework is sufficient to get you started (on Ubuntu at least). The documentation is good and the community seems to be responsive, judging from the forum posts, github issues. However, I’ve not engaged in any real trouble shooting requests (as things went well). The only issue I encountered when setting up the Caffe-Segnet implementation was the above mentioned “out of memory error”. Although people mentioned that they got things running on 4GB cards I just couldn’t get it to work. In the end I realized that the memory being used to drive my displays (~500MB) might just tip the scale. Indeed, offloading graphics duties from the dedicated card onto the motherboard’s integrated graphics did free up enough memory to make things work. An important note here is that you need to install the CUDA Toolbox and drivers without their OpenGL drivers. The CUDA Toolbox overwrites the Ubuntu originals and breaks the graphics capabilities of your integrated graphics in the process. To install the CUDA Toolbox correctly, with full memory capabilities first select the integrated graphics (iGPU) as your preferred GPU in your system BIOS. Then download the CUDA Toolbox runtime and run the following command: sudo ./cuda_x.x.x_linux.run --no-opengl-libs This should install all CUDA libraries and drivers while at the same time prevent the Toolbox from overwriting your iGPU OpenGL libraries. In addition you might also want to make sure you initiate the NVIDIA devices on boot (devices should be accessible but not driving any display) by placing the below script in your /etc/init.d/ folder. #!/bin/bash NVDEVS=lspci | grep -i NVIDIA N3D=echo "$NVDEVS" | grep "3D controller" | wc -l
NVGA=echo "$NVDEVS" | grep "VGA compatible controller" | wc -l N=expr$N3D + $NVGA - 1 for i in seq 0$N; do
mknod -m 666 /dev/nvidia$i c 195$i
done

mknod -m 666 /dev/nvidiactl c 195 255


(I’m not quite sure the latter script has any use as the devices might be initiated already anyway. However, I rather list it here for future reference.)

In the end, both the MIT scene recognition model as well as the Caffe-Segnet implementation work. Below you see the output of the Segnet Camvid demo, which uses the input of a webcam and classifies it on the fly into the 12 classes. The output is garbage as the scene does not represent a street view (the original intent of the classifier), but it shows that the system runs at a solid 210ms per classification.

# Deep learning snowy images

Past week I started to play with the Caffe deep learning framework. Although I initially planned on using the SegNet branch of the Caffe framework to classify snow in PhenoCam images. However, given that it concerns a rather binary classification I don’t need to segment the picture (I do not care where the snow in the image is, only if it is present). As such, a more semantic approach could be used.

Luckily people at MIT had already trained a classifier, the Places-CNN, which deals with exactly this problem, characterizing an image scene. So, instead of training my own classifier I gave theirs a try. Depending on the image type, and mostly the view angle the results are very encouraging (even with their stock model).

For example, the below image got classified as: mountain snowy, ski slope, snowfield, valley, ski_resort. This all seems very reasonable indeed. Classifying a year worth of images at this site yielded an accuracy of  89% (compared to human observations).

However, when the vantage point changes so does the accuracy of the classification, mainly due to the lack of images of this sort in the original training data set I presume. The image below was classified as: rainforest, tree farm, snowy mountain, mountain, cultivated field. As expected, the classification accuracy dropped to a mere 13%. There is still room for improvement using PhenoCam based training data. But, building upon the work by the group at MIT should make these improvements easier.

# Odyssey caffe-SegNet installation instructions

Here I provide a simple set of bash commands and settings to get started with the caffe-SegNet tutorial on the Harvard Odyssey cluster and it’s NVIDIA CUDA capabilities. If the below setup works you can move on and start processing your own data.

# SegNet required modules and
# interdependencies


source ~/.bashrc

Start a GPU session on the cluster

srun --pty --gres=gpu -p gpu -t 600 --mem 8000 /bin/bash

# clone the tutorial data and rename the directory
git clone https://github.com/alexgkendall/SegNet-Tutorial.git
mv SegNet-Tutorial Segnet

# move into the new directory
cd Segnet

# clone the caffe-segnet code
git clone https://github.com/alexgkendall/caffe-segnet.git

wget -q https://www.dropbox.com/s/hbhzl2bwm19vtd0/FindAtlas.cmake?dl=0 -O ./caffe-segnet/cmake/Modules/FindAtlas.cmake

# create the build directory
mkdir ./caffe-segnet/build

# move into the build directory
cd ./caffe-segnet/build

# create compilation instructions
cmake -DCMAKE_INCLUDE_PATH:STRING="$CUDNN_INCLUDE;$BOOST_INCLUDE;$GFLAGS_INCLUDE;$HDF5_INCLUDE;$GLOG_INCLUDE;$PROTOBUF_INCLUDE;$SNAPPY_INCLUDE;$LMDB_INCLUDE;$LEVELDB_INCLUDE;$ATLAS_INCLUDE;$PYTHON_INCLUDE" -DCMAKE_LIBRARY_PATH:STRING="$BOOST_LIB;$GFLAGS_LIB;$HDF5_LIB;$GLOG_LIB;$PROTOBUF_LIB;$SNAPPY_LIB;$LMDB_LIB;$LEVELDB_LIB;$ATLAS_LIB;$CUDNN_LIB;$PYTHON_LIB" ..

# compile and test all code
make all
make runtest