Layman’s notes on #EUCON16

This weekend, for the fourth time, the students of the Harvard Kennedy school (et al.) put together the European Conference. A conference dedicated to European politics. First of all I should commend the students for bringing together a impressive set of speakers, among others, former president to the European Comission, Jose Manuel Barosso as keynote speaker.

Below are a few observations I made as an outsider and concerned European citizen, albeit with enough ‘little grey cells’.

On optimism and the state of the EU -  First of all, the keynote of Barosso underscored the success of the EU despite the setback of the economic crisis. Irrespective of this event, and the rather grim predictions by economist world wide, the EU grew in numbers instead of falling apart. However, he was correct to note that the current migrant crisis might pose a bigger threat to EU integration than the economic crisis. He warned for growing xenophobia in member states, not in the least more recent ones which often don’t share the common historical context of both World Wars, and to cite Kathe Kollwitz, the general sentiment of “nie wieder krieg (faschismus; by extension)”. In general Barosso’s keynote was thought provoking, yet rather positive. Given the tumultuous state of the EU I hope that Barroso was right in citing Jean Monnet “People only accept change when they are faced with necessity, and only recognize necessity when a crisis is upon them.”

On TTIP and trade agreements - The panel discussion on (Transatlantic Trade and Investment Partnership) TTIP provided me with some new insights as well. For one, the panel was unfairly balanced in favour of TTIP, with only Dan Mauer providing some push-back. My most “memorable” moment was the rather gratuitous cop-out by the EU ambassador to the US, David O’Sullivan, on a question regarding transparancy of TTIP. A member of the audience commented on the fact that TTIP is unprecedented in it’s transparency during negotiations, and how this was perceived by negotiating partners? As mentioned ambassador O’Sullivan reposted that, indeed, the negotiations have been relatively open, if not forced due to an initial leak, but that this has little value as most people would only find the documents boring - as such still no full text is provided only legally void position papers and summaries. This rather jaw dropping statement is not only elitist but does injustice to any democratic principles. A surprisingly cheap cop-out to a valid question, and concern that many EU citizens share. I would have expected a more coherent response from O’Sullivan. This lack of respect for genuine concern by citizens, as well as the lackluster response of the EU to increase transparency, is a testament to what I would call a forced hand, rather than due diligence on part of transparency. Sadly, I fear that underhanded changes, such as recently highlighted in TPP, will sure make it’s way into TTIP without full transparency.

On privacy and Safe-Harbor - In a post Snowden age it’s clear that the US will have to start thinking about privacy as a human right. The panel seemed to agree that this is a demand of both industry as privacy NGOs. The panel was in consensus that this should happen in the near future, although current implementations such as Privacy Shield (Safe-Harbour’s replacement) is equally dead on arrival - say this isn’t the final solution. The main take home message is that action will be required in the US, if not forced by the EU. Little was mentioned on how this would interface with for example TTIP, if at all. Yet, overall the outcome for US citizens will only be for the better.

Anyway, back to the business of the day - modelling ecosystem responses to climate change.

 

Processing Jungle Rhythms data: intermediate results

After a few days of struggling with R code I have the first results of processed data at my fingertips! Below you see a 3 window plot which shows the original image on top, the annotated image in the middle and the final extracted data at the bottom. The middle window gives you an idea of the geometry involved in the calculation of the final data at the bottom. I’ll shortly detail my approach, as I abandoned the idea I proposed in a previous blog post.

Using the most common six coordinates for each yearly sections (red dots overlaying the individual markings as green crosses - middle window) I calculate the approximate location of the rows as marked by the red dots on the vertical bold lines (line - circle intersection). All annotations are then projected on to this ideal row (projection of a point onto a line), rendering the coloured lines for different observation types. Finally, with the overall distance of each given row (on a half year basis - point to point distance) I calculate the location in time which is occupied by an annotation. Classifying the lines into life cycle event types is done by minimizing the distance to the ideal row.

All images are shown to 10 independent citizen scientists, and for each individual annotation these values are summed. Where there is a high degree of agreement among citizen scientists I will see a larger total sum. If there is an unanimous agreement among them the sum would be 10 for a given day of year (DOY). As not all subjects are retired, the value as displayed below only has a maximum count of 8. The spread around the edges of the annotations is due to variability in the classifications.

You already notice some pattens in the life cycle events. More on these patterns in a later post, when I can match the data with species names.

 

Processing Jungle Rhythms data: coordinate transformations and line intersections

I mentioned that I started working on processing some annotations to data. This will get me a feeling for the data quality, but more so get me thinking about how to process the data efficiently.

In this blog post I’ll quickly outline the methodology I’ll use to process the annotations. First I have to give a quick summary on what the data looks like once annotated, how to deconstruct this data into usable data.

Below you see a picture of an annotated yearly section. Red dots outline the most common intersection coordinates for the yearly section (green crosses represent all measurements), while green lines represent the annotated life cycle events within the yearly section. Note the accuracy of all the annotations, rather amazing work by everyone who contributed!

With these key locations (red dots) of the yearly section, mainly: the start, middle and end (providing the general orientation of the yearly section within the image), the annotations can translated into true data (compensating for skewness and warping in the picture).

Each yearly section, will be processed a half year at a time using roughly four steps:

  1. For each year I make sure the bottom axis (bottom left - bottom middle or bottom middle - bottom right) is aligned along the Cartesian x-axis. This is done by rotating the data around the bottom left or bottom middle point, respectively. -> in this image the axis is relatively close to optimal!
  2. Since I only process data within half a yearly section I trim annotated lines to fit each six month period.
  3. After trimming the annotations to fit neatly within the first six months I need to transform all these coordinates to days within a year. I know the spacing between the row is equal. As such, the total length of a row can be calculated as the distance of a line which crosses the beginning and end of half a yearly section.
  4. What remains is to calculate the distance between the beginning and start of an annotated segment relative to the total length to determine the days they cover during the year.

Finally, all these data will be combined into a matrix. This matrix will then be linked to the original species data, kept in a separate file.

Scraping Tropicos data in R

Just a little web scraper to mine species distribution data from the Tropicos botanical database.

The script scrapes the first returned result, or if there is only one result it will display the available data (a data frame of all available data). If not distribution data was found the function will return an NA value.

#' @param species: genus species or genus
#' @param quiet: TRUE / FALSE provides verbose output
#' @keywords Tropicos, species distribution
#' @export
#' @examples
#' # with defaults, outputting a data frame with species distribution
#' # for Clematis
#' df <- tropicos.species.distribution()
#' # returns NA if no data are present
#' [requires the rvest package for post-processing]

tropicos.species.distribution <- function(species='Clematis',quiet=TRUE){
  
  # read the required libraries
  require(rvest)
  
  species = gsub(" ", "+", species)
  base_url = "http://www.tropicos.org/NameSearch.aspx?name="
  url = sprintf("%s%s",base_url,species)
  
  # load the page into memory
  html_page = read_html(url)
  
  # first check if the page lists a table of species
  # or was redirected to the only search result presented
  check = html_page %>%
    html_nodes("#ctl00_footerControl_citationDiv") %>%
    html_text()
  
  sel = regexpr("(http://).*[0-9]",check)
  final_url = regmatches(check,sel)
  
  # if the length of the final_url is 1 then use the final_url
  # to move on, otherwise select the first listed species
  # and move to this page to list the distribution
  if(length(final_url)>0){
    
    print("test")
    url = sprintf("%s%s",final_url,"?tab=distribution")
    html_page = read_html(url)
    
    distribution = html_page %>% 
      html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
      html_table() %>%
      data.frame()
    
  }else{
    
    # if multiple search results are present, pick the top of the list
    # and follow this link to the distribution data if available
    # load the first species in the table generated by the species search
    first_link_listed = html_page %>% 
      html_nodes("#ctl00_MainContentPlaceHolder_nameSearchControl_gridView tr:nth-child(2) td:nth-child(3)") %>%
      html_nodes("a") %>%
      html_attr("href")
    # clean up the string, grab the species indentifier number
    first_link_listed = gsub("[/Name/]", "", first_link_listed)
    
    # format the new string, linking to the species distribution
    url = sprintf("http://www.tropicos.org/Name/%s%s",first_link_listed,"?tab=distribution")
    html_page = read_html(url)
    
    # extract the species distribution in full from the page
    distribution = html_page %>% 
      html_nodes("#ctl00_MainContentPlaceHolder_nameDistributionsControl_gvwResults") %>%
      html_table() %>%
      data.frame()   
  }
  
  if(prod(dim(b))==0){
    distribution = NA
  }
  
  # return the data frame, verbose or not
  if (quiet == FALSE){
    print(distribution)
  }
  return(distribution)
}

 

Crunching Jungle Rhythms data

I haven’t blogged about Jungle Rhythms in a while. So here is a quick update on things! I’m currently working through the first batch of Jungle Rhythms data.

Although the fully annotated data is not in, I’ve partial data to work on and get an algorithm running. This algorithm would extract the phenological data as annotated by everyone who contributed and turn them into true dates (or weeks of a particular year).

Sadly, the data structure as used by Zooniverse is currently less than ideal. Zooniverse data exports use a comma separated file format (CSV) with JSON content. However, R, in which I do most of my statistics and processing, is rather bad in dealing with JSON data. Even using Python, which handles JSON better, the data structure remains rather cumbersome. I submitted a GitHub ticket (i.e. a way to request features in software or report bugs) raising this concern, and it will be addressed in the near future.

In the mean time, I’ll still code up a processing routine to assess the intermediate results. Hopefully the new data format will make all this a bit more straightforward and transparent.

 

 

Pagination


© 2018. All rights reserved.

Powered by Hydejack v7.5.1