Home
UsernamePassword
Group discussion > Findings and possible solutions with Sample-based data publishing at SiBBr

Findings and possible solutions with Sample-based data publishing at SiBBr

David Valentim Dias
548 days ago

Hi everyone!

We were working testing here at SiBBr (www.sibbr.gov.br) the new IPT that includes the Event Core and Measurement or Facts extensions and we have some findigs to share.

Our first finding, which was already addressed by us in the IPT code and submitted at Github, was a dataset cotaining only measures, required and Occurrence ID and Basis of Record, making the Occurrence Extension mandatory. That doesn’t make sense because the dataset doesn’t have any occurrence at all! (i.e. machine observations of CO2 levels from a data logger in a plot). We patched the IPT code to solve that, but we thik that a better impplementation is needed.

When a dataset with and event that have a group of occureces with measures for each one, like annual tree measurements in a plot during several years (each year campaing is an event and a tree is an occorrence that have measures), an EventID is mandatory to link the measures with the event on the Measurement or Facts extension, but we found there is no an ID to link the occurence with the measure. As a workaround, we put on Measurement ID the OccurrenceID, so we can keep the relationship between the occurrence and the measure, but that is overloading the Measurement ID field. We thik that include OccurrenceID as well will solve the problem.

During the proccess of standarizing the data to DwC, we found that a table should be splitted in up to three diffenrets tables: one for the event core, another for the occurrence extension and a third for measurement or facts extension. Even with a team with good computer skills and making use of R, Google Refine, Go Programing Language and Excel, we struggle to arrange the data in the right way, so we can’t imagine how difficult could be for a common user. Will be great if the IPT have a way to simplfy that process, like a wizard to “melt” and split the tables.

We also think that having a way to store the original data (like verbatim.txt) or a way to merge the data back, will make the data easier to consume to users. We initiated a threand at TDWG about this issue here: http://lists.tdwg.org/pipermail/tdwg-content/2015-October/003569.html and there are several replies (Reverting the process of DwC standardization) here: http://lists.tdwg.org/pipermail/tdwg-content/2015-October/thread.html. There is also a similar issue open at GitHub here: https://github.com/gbif/ipt/issues/1165.

Kyle Braak
548 days ago

Dear David,

Thanks for your feedback. Let me try and address your concerns one at a time:

1. "a dataset cotaining only measures, required and Occurrence ID and Basis of Record, making the Occurrence Extension mandatory."

To clarify, the IPT supports publishing 4 types of resources by default:

  1. primary taxon occurrence data (using occurrence core)
  2. taxon checklists (using taxon core)
  3. sampling event data (using event core)
  4. general metadata about data sources (using no core at all)

You can add a new core to the IPT (e.g. Measurements or Facts) following these instructions https://github.com/gbif/ipt/wiki/IPT2Core.wiki

Here are the validation rules listed for your convenience:

  • A core record requires a unique identifier
  • Any occurrence record (in core or extension) requires a basis of record
  • An extension record requires an identifier to link to the core record
  • An extension record of type Occurrence does not require an occurrrenceID, however, if occurrenceID is mapped it will require it to be unique

See the Data Validation section in the following section of the IPT User Manaual for more help with validation rules: https://github.com/gbif/ipt/wiki/IPT2ManualNotes.wiki#published-versions

2. "but we found there is no an ID to link the occurence with the measure. As a workaround, we put on Measurement ID the OccurrenceID"

OBIS addressed this very problem, by creating a custom version of the Measurements or Facts extension here: http://rs.gbif.org/sandbox/extension/obis-ExtendedMeasurementOrFact.xml

This custom extension is still in the sandbox, meaning it is only available to IPTs used in test mode.

For now, only measurments relating to the core record can be added. It is a misuse of the extension, if measurement records relate to an extension (e.g. occurrence records in this case).

3. "Will be great if the IPT have a way to simplfy that process, like a wizard to “melt” and split the tables."

Your request is noted, and thank you for all the feedback you have provided in both the TDWG and IPT channels.

In the meantime, we can start documenting ways people are able to transform their data into the star format using different tools and programming languages. For example, I am successfully using Java and I will put my code on GitHub soon.

In addition, a publisher from Norway using R has made their scripts available from GitHub here: https://github.com/andersfi/DwC/blob/master/exsample_Z-sweeps.lepidurus/dwca_example_z-sweeps.R

It seems quite common in cases I have dealt with so far, to store information about events, occurences, locations, and measurements separately.

Perhaps you can send me some examples of files you are having difficultly splitting, so that I can understand better the problems you are facing?

Thanks once again for all your feedback, and talk to you again soon.

Kyle

David Valentim Dias
533 days ago

Dear Kyle,

Thanks for all your feedback, we appreciate it very much. I wish to reply each issue at a time too:

 

1. On this finding, when I was talking about "a dataset containing only measures, required and Occurrence ID and Basis of Record, making the Occurrence Extension mandatory." I'm refering to my Git pull request https://github.com/gbif/ipt/pull/1192, where the requirement of an OccurrenceID and BasisOfRecord was relaxed in order to allow publishing if a sampling event doesn't have associated occurrences. Anyway, a reminder of the basics doesn't harm.

 

2. We (SiBBr) had a workshop last week with datasets from vegetation plots. Most of the plots have subplots and in each subplot they have labeled trees. They measured Diameter at breast height (DBH) of each tree once a year during several years (ranging from 3 to 19 years), so the data could be interpreted as events (the event of going to the field and take measurements at each subplot, each year),  with occurrences (each tree) and measures associated to the occurrences (each tree DBH).

 

For us, the star format is having some limitations on this scenario and we don’t see why not to link the extensions using an occurrenceID, however we will be more than happy to hear any suggestions about the approach we took, yours and from the community. To clarify I’m attaching a file with the original table and the processed tables ready to publish.

 

Of course the OBIS extension could solve the immediate problem, but we prefer to promote the discussion here, as the whole idea is to share the data internationally using a common and consensuated platform. Even more, the community could test possible changes before incorporate them (or not) into the IPT.

 

3. I totally agree with you, we published on GitHub a tool to transform the data from “wide” to “long” format,  that replicates the “melt” command from the package “reshape2” of R, using a simple web interface. it is available here: github.com/sibbr/tableconverter. We successfully used it in a workshop. We are just awaiting the translations and more detailed documentation to be ready to do the announcement for the community.

 

Thanks again for your valuable feedback and contributions. We invite the other members to post their comments as well.


David

Sample of dataset, conversion and script used.

Anders G. Finstad
527 days ago

 

Hi,

Thanks for sharing these experiences David.

I have also been working with event data lately and made some of the same experiences as you do. I hope it’s OK that I post a few comments (following the point by point style from above).

2. Your problem of not being able to attach measurements and facts to the occurrence extension also raised some considerable headache in a workshop we had in Norway in October. I think it’s a rather general problem in that we often have several measurements related to individual occurrences, and at the same time need to document the sampling process using the event core. One solution we discussed was to publish the data as two datasets, one for the sampling process, and one for the occurrences with the individual measurements (possibly in another repository). However, this solution is both inconvenient and rather unsatisfying. The other option discussed was to use the dynamicProperties field. However, as far as I can see this leaves us with no possibilities to document the different measures and it is even more inaccessible to the non-technical user than a long format measurement and fact table. I don’t know the technical issues that potentially have to be solved or what would be the best solution. I am, however, confident in that this is going to be a recurring problem that needs to be addressed somehow.  

For your specific case, and as far as I interpret your dataset, there is the diameter of the trees that are your primary measure. This could potentially have been easily captured by the organismQuantity field? However, I can’t find diameter in the Quantity Type Vocabulary. I guess that would be an easy thing to fix?

3. Thanks for this David. Converting from long to wide formats is something non-script based users do have troubles with. Such a tool should be very useful. Looking forwards to updates (=

Anders