Group discussion > First sample-based dataset available in GBIF.org

Alberto González Talaván
693 days ago

Hey everyone,

I was so busy with follow-ups from GB22 that I didn't realise that the first sample-based dataset was published in GBIF.org!


Can't wait to hear about your experiences using the sample core and registering sample-based datasets in GBIF.


Anders G. Finstad
685 days ago


Thanks for noticing! This was a small dataset I put out in conjunction with a data-publishing workshop in Trondheim, Norway that we (NINA and NTNU University Museum) organized together with the Norwegian GBIF node. The participants where mainly technicians and researchers working hands-on with ecological datasets.

First of all, I must say that I am truly thrilled by the possibilities of documenting the sampling process through the event core. As an ecologist working in the fields of species distribution modeling and population dynamics this opens up GBIF also as a useful arena for sharing a wide range of data-types with large application scope. Particularly the, the Holy Grale is absence data, that are cumbersome to document in the occurrence core. In addition, abundance information (that need sampling documentation to be worth anything) is much better covered.

Some issues discovered along the way (besides my confusion of the usage of DwC termes – the specific dataset mentioned above have been updated after suggestions from Kyle Braak - thanks!), and I don’t think the event core is fully matured yet. However, I also guess that filling in with real data is the only way to discover needs.

My first impression is that there really should be a field for explicitly documenting the taxonomic range that have been investigated in a sampling process (e.g. something like “eventTaxonomicRange”). Although this could be done in the metadata, this could vary from event record to event record for more complex datasets.  This would render the documentation of “absence” data so much easier. Indicating “absence” in the occurrence record under occurrenceStatus is not only cumbersome when working with a large number of species, it is also often impossible (imagine a vegetation ecologist making plot-survey identifying all vascular plants).

Secondly, there have been confusion among our workshop participants (myself included) regarding the use of the doublet term sampleSizeValue/sampleSizeUnit. We (as in ecologists) understand sample size as something related to the population being sampled (e.g. number of observations) and the term has this distinct meaning in statistical analyses. There are also some issues with not at the present being able to document your number of replicates (i.e. number of sampling units), in case you are not using the individual replicate as source - which may happen for a number of reasons.  I have tried using the samplingEffort for this, but that apparently only swallows a string value, and in order to facilitate reuse of data having a numerical field as a string is cumbersome. In conclusion, I think keeping an alignment between how users interpret data and DwC term usage is something to that should be strived after.

We also had some discussions during the workshop about nested designs, for e.g. transect type of censuses. No clear conclusion about how this could be solved (elegantly) within the present structure. Maybe some extensions are needed?    

Alberto González Talaván
685 days ago

Hi Anders,

Thanks for your message and for the details about your experience publishing this dataset. Much still needs to be tested and documented before we can make the publishing of sample-based datasets mainstream, but that is why your experience and the expecrience of many other testing the new additions to the standards are so critical at this stage.

I hope others follow the example and share their experiences in this forum! Thanks for that!