# Jungle Rhythms user statistics: location

I needed to get a grasp on where different Jungle Rhythm users come from for a grant I’m writing. Zooniverse kindly provided me with some country based summary statistics providing me with some insight into the user activity per region.

Currently most of the users come from English speaking countries, with the highest number of sessions from the US (47%) and the UK (17%). High session numbers are also noted for Canada and Australia (both ~3.5%). Surprisingly Belgium scores really well accounting for 4% of the user sessions (potentially due to my own activity when I’m in Belgium and ElizabethB who has been very active on the project). The remaining top 10 countries, France, Germany, The Netherlands and Poland are combined good for another 10% of all sessions. All EU countries combined are good for ~31% of all the sessions, indicating that the Jungle Rhythms user pool is largely split between the US and the EU (with a strong focus on the UK). Below you find a barplot representing the described statistics.

[caption id=”attachment_1274” align=”aligncenter” width=”768”] Barplot showing the origin of user sessions in %, only the top 10 countries are listed.[/caption]

# MCD10A1: a combined MODIS aqua / terra snow cover product

For an ongoing project I had to calculate fractional snow cover transition dates across Alaska. These transition dates mark when snow levels drop below 5% coverage or when continuous accumulation of more than 5% starts. The period between these two dates can be considered snow free, and has the potential to support vegetation growth.

For this exercise I used my usual set of tools and data streams (R + local MODIS HDF tiles). Sadly, this processing chain took forever. Given the rather simple nature of the code I decided to turn to Google Earth Engine (GEE) and rewrote the analysis into GEE javascript. Below you find the resulting MCD10A1 product, a combined MODIS Aqua / Terra snow cover product. The original MODIS products (MOD10A1/MYD10A1) as processed by the National Snow and Ice Data Center have the tendency to be biased low. Hence, my solution fuses the two data streams using a maximum value approach across the two data streams. Furthermore, the script is written to loop over all years with sufficient data (2003 - current), and collects the above mentioned transitions dates. Finally, the script runs a simple regression analysis to mark trends in snow melt dates, answering the question if snow melt occurs later or earlier with each passing year!?

Below you see a snow cover trend map of the US and Canada, with earlier snow melt marked in red (-) and later snow melt marked in blue (+). Some patterns stand out, mainly mountain ranges in the West of the US have seen earlier and earlier snow melt dates. Alaska, doesn’t see very clear patterns, aside from earlier snow melt in the southern peninsula. Arctic Canada does see clear earlier snow melt throughout. Surprisingly, some later snow melt dates can be noted as well. From the middle of Alberta, Canada sweeping east to the great lakes a trend of later snow melt dates is noted. A word of caution is needed as these regions might be heavily influenced by the “polar vortex” in 2015, creating a long winter season, and potentially skewing the trend analysis. Furthermore, locations without data are places with a very short snow cover period, which were excluded for clarity (i.e. these are areas were snow cover has less of an influence on the growing season).

In short, as long as the analysis isn’t too complicated GEE is a quick way to get good and reproducible results. Sadly, the system does have it’s limitation and makes true time series analysis (with custom aggregation and curve fitting techniques) rather difficult if not impossible. I feel that these kinds of analysis are clearly outside the scope of the more GIS oriented GEE setup. I tip my hat to Google for providing this service, rather impressive I must say.

# MODIS HDF data extraction in R

The Moderate-resolution imaging spectroradiometer (MODIS) sensors on both Aqua and Terra platforms provide a wealth of (environmental) data. However, manipulating the raw files can be challenging. Although Google Earth Engine provides an easier way to access these data, as most of the MODIS products are hosted, sometimes direct manipulation is still necessary. Here I quickly outline how to extract data from raw MODIS files using R.

## Dependencies

To download extract data from a MODIS HDF file in R you need a few packages. Mainly:

After installing these packages make sure you load them to continue.

require(raster)
require(sp)
require(MODIS)


## Finding your way on the globe

First you have to determine where your data is located within the tiled format MODIS data is distributed in. All tiles are denoted with a horizontal (h) and a vertical index (v). For the Land Products there are roughly ~350 or so tiles with varying degrees of actual land coverage. All MODIS data are distributed in a custom sinusoidal projection, hence the rather weird looking shape of the map.

### Tiles

In order to find the right tile we load a coordinate into R. In this case it’s the location of Harvard Yard, or 42.375028, -71.116493 latitude and longitude respectively.

# the location of Harvard Yard
harvard_yard = cbind(42.375028, -71.116493)

# define the projection system to use
# lat-long in this case
latlon = CRS('+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0')

# create a spatial object from the coordinate(s)
coordinates = SpatialPoints(harvard_yard, latlon)

# get the horizontal and vertical tile location for the coordinate(s)
horizontal = extract(tiles,coordinates)$h[1] vertical = extract(tiles,coordinates)$v[1]


Once you have the horizontal and vertical tiles you can download only these tiles instead of the entire archive.

### Extracting data

As previously mentioned, data is stored in HDF files. These files are readily readable by the GDAL library. However, given the structured nature of these files you have to know what you are looking for. These layers in structured HDF files are called scientific data sets (SDS). You need to specify one to read them individually.

# read in all SDS strings (layer descriptors)
# using the MODIS package
sds <- getSds(filename)

# grab layer 4 of a particular hdf file
my_hdf_layer = raster(readGDAL(sds\$SDS4gdal[4], as.is = TRUE))

# should the HDF file not contain any layers, or a singular one
# you can default to using raster() any additional arguments
my_hdf_file = raster(filename)


Now you have succesfully read in data. However, this data is projected using the sinusoidal projection, not the lat-long. In order to extract data at given locations we have to translate the coordinates of these extraction points to the MODIS sinusoidal projection.

# specify the sinusoidal projection
sinus = CRS("+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs")

# transform the coordinates
coordinates_sinus = spTransform(coordinates,sinus)


Finally extract the data at the new sinusoidal coordinates.

# extract data
my_data_point = extract(my_hdf_layer,coordinates_sinus)


# Jungle Rhythms: keeping score

For those keeping score, the project is 4 months in and just passed 1/3 mark of retired subjects and is rapidly approaching the half way point in terms of overall work done (number of classifications). Cheers, for everyone who contributed!!!

# open access != easy access

Over the past few years, writing software left and right, I noticed a trend. Most of the software I write serves one purpose: making open access data - accessible! This should not be!

There has been a steady push for open data, increasing transparency and reproducibility of scientific reporting. Although many scientific data sources are indeed open, their access is not (easy), especially for the less computer savvy.

For example, NASA provides a wealth of open data. Although there are a few tools on which one can rely, non are user friendly. Luckily, most of those can be replaced by open source tools coded by data users (ModisTools, MODIS LSP, GDAL). The European Space Agency (ESA)  fares even worse, where their data access is more restrictive and accessing data is an equal or even bigger mess. However, ESA has seen a recent push for a more user centric experience on some of the projects.

Looking at the field of ecology some projects do well and maintain APIs such at Daymet and Tropicos. Although Tropicos requires a personal key which makes writing toolboxes cumbersome. The later left me with no other choice than to scrape the website. The Oak Ridge National Laboratories (ORNL) also offers MODIS land product subsets through an API, with interfaces coded by users. However, these data are truly open access (as in relatively easy to query) and should be considered the way forward.

This contrasts with for example resources such as The Plant List, which offers a wealth of botanical knowledge guarded behind a search box on a web page, only to be resolved by either downloading the whole database or by using a third party website. Similarly the National Snow and Ice Data Center oldest snow and ice data is stored in an incomprehensible format (the more recent data is offered in an accessible geotiff format). Surprisingly, even large projects such as Ameriflux, with a rather prominent internet presence, suffer the same fate, i.e. a wealth of data largely inaccessible for quick and easy use and review.

Pooling several data access issues in the above examples, I think I’ve illustrated that open access data does not equal easily accessible data. A good few of us write and maintain toolboxes to alleviate these problems for themselves and the community at large. However, these efforts take up valuable time and resources and can’t be academically remunerated as only a handful of tools would qualify as substantial enough to publish.

I therefore would plead that data producers (projects alike) to make their open data easily accessible by:

1. creating proper APIs to access all data or metadata (for querying)
2. making APIs truly open so writing plugins can be easy if you don't do it yourself
3. writing toolboxes that do not rely on proprietary software (e.g. Matlab)
4. assigning a true budget to these tasks
5. appreciating those that develop tools on the side