Interest group about data quality, cleaning and fitness-for-use. TDWG data quality interest group.

Share |
Group discussion > Report of the Task Group on GBIF Data Fitness for Use in Distribution Modelling

Report of the Task Group on GBIF Data Fitness for Use in Distribution Modelling

483 days ago


The Task Group on Report of the Task Group on GBIF Data Fitness for Use in Distribution Modelling was established by the GBIF Secretariat in 2015 to help improve the fit of data for use by the distribution modelling research community.

The task group has consulted experts, gathered opinions and presented their insights on the current situation, challenges and recommendations in the report.

We welcome feedback, comments and, possibly, complementary recommendations from you here, submitted as as forum posts no later than by 15 May 2016. We will open the web form to collect your use-specific data quality stories shortly.

GBIF S contact: Dmitry Schigel, dschigel@gbif.org

Adam Smith
456 days ago

I found the Draft Report of the Task Group on GBIF Data Fitness for Use in Distribution Modelling spot on.  Most of it is very solid, so I'll just focus on the two exceptions.

First, although GBIF amalgamates an enourmous number of original data providers' data, it should strive to be current.  For example, my understanding is that since 2010 GBIF has received a copy of the TROPICOS database (the second-largest database for plants) every half year, but hasn't actually incorporated that new data into its holdings since 2010.  Obviously I'm biased for being at the host instituton of TROPICOS, but I do wonder if there are other similar situations.

Second, under the "Future directions" section, there is very limited mention of joining the large functional trait databases coming online with GBIF data.  The subsection on physiology addresses this in part, but to a limited extent.  I would suggest broadening this section to be more inclusive of the many kinds of trait that are being measured.

Adam Smith
Missouri Botanical Garden

Alex Thompson
456 days ago

On the topic of Mobot's data, there were long standing issues with the old DIGIR feed that was powering the export to GBIF. Speaking as a third party who also considered using that feed, it wasn't functional enough for a full harvest (this is an endemic problem with DIGIR feeds as data sizes grew beyond the scope it was built for). 

They have recently stood up an IPT instance (http://ipt.mobot.org:8080/ipt/) which enables much more efficient consumption by GBIF and iDigBio, and the data is now current in both locations.



This transition is happening all across the GBIF datasets as their new policies come into effect, so the currency of data overall should be improving fairly significantly.

Traits are a sticky subject for iDigBio, and probably GBIF as well (although I'm not privy to their internal discussions). Most of the trait data recorded on specimens is unstructured (contained with notes, dynamic properties, or measurment extensions). Other non-specimen sources of trait data (like EOL's TraitBank) tend to aggregate information from many sources at the taxonomic level (i.e. for species, not for specimens). This is fine, in general, and matches most uses of the data, but for specimen focused portals like GBIF and iDigBio it complicates matters. Do we index specimens based on traits provided at the species level, even if the specimen itself may not express that trait? Can we index those traits at the species level in such a way to enable efficient trait -> species -> specimen querying, even for traits that may encompass hundreds of species?

iDigBio has been collaborating with Jorrit Poelen and EOL to explore these issues outside of our main development tracks. He has produced http://www.effechecka.org/ as a simple use case, but the underlying infrastructure and logic is quite complicated. If you have specific use cases for trait data in large biodiversity portals, I'd love to hear about them.

Alex Thompson, iDigBio Software Products Lead (godfoder@acis.ufl.edu) 

452 days ago

Dear Adam and Alex,

Many thanks for joining the discussion here, and the useful comments. Hope others will follow you in discussing the report and related ideas here. A reminder to all reading this thread that we the discussion is open until 15 May.


Arthur Chapman
436 days ago

I have just now had a chance to read through this document.  I find the document covers most aspects of Fitness for Use for SDM, however, one aspect I believe could have been covered in more detail and with a recommendation(s).  That is the collection and storage of "Absence" data and the recommendation to GBIF that provision be made for recording and storing that information where it can be adequately inferred.  Some databases are starting to document absence data, but my understanding is that there is no adequate way for this to be linked in GBIF.  This may also require modification of Darwin Core to provide for transfer of this data.  There probably needs to be a comprehensive discussion on how best to collect, transfer, store, and even use absence data.

I congratulate the authors on preparing this synthesis.


435 days ago

Dear Arthur, all,

Many thanks for your feedback. In fact, pubsling sample based data using Event Core allows for publication of (pseudo?) absence information though inlcusing of taxonomic target for the survey, a list of species to be observed, out of which only a portion was actually seen and counted, the rest is technically absences (in the sence of lack of observations). So I beleieve this is possible.

Another question is how to find the absence data? The was an old, now fixed bug when obervations with count zero were anyhow shown on the GBIF pubslihed maps, not anymore. We don't have a filter for absences I think, but one can play with counts and separate zeros from positive figures from empty field. As always, this is subject of content availability, which is controlled by publishers, and before ecologists join in in masses with all their data, the output will be very shallow. Having said that, it is very important to demonstrate the argumanted demand, tha data use market for absence data - what is evident and important for ecologists and distribution modeller, may be see as something minor for data holders - so thanks for bringing this up.

This after-report miniconsultation is formally closed, but the discussion forum will remain open, so further opinions are very welcome.

Dag Endresen
435 days ago

For reporting absence data with Darwin Core it is possible to use the dwc:occurrenceStatus, http://rs.tdwg.org/dwc/terms/occurrenceStatus with the suggested controlled values "present" and "absent". Using the event core and the taxon coverage to describe inferred absences (without explisit declaring dwc:occurrenceStatus), as Dmitry mentions, could be another useful approach?