Home
UsernamePassword
This group aims to be a platform for general discussion on issues related to biodiversity data publishing

Share |
Group discussion > Feedback on published data-workflow for improving the quality of published data

Feedback on published data-workflow for improving the quality of published data

Aaike De Wever
2116 days ago

Although maybe not a strictly data publishing topic, I am using the opportunity to ask this here...

When talking to scientists, one of their main objections for using and providing data is that the quality of the data doesn't meet their expectations (and they don't see the advantage to add their data). Although part of this critisism may be due to the fact that they don't fully understand how the GBIF data network works and what may be the origin of specific issues, it is clear that there is indeed quite some room for improvement when it comes to data quality.

I believe that this could, in part, be remedied by (a) providing tools and flags to allert data users of potential issues and (b) provide feedback mechanisms that allow users to comment on specific datasets and records and help the data provider in correcting errors where possible/necessary.

For (a) I am aware that GBIF is currently already doing some 'flagging' e.g. records with a mismatch between coordinates and country info, but for the sake of this discussion I would like to focus mostly on (b). I wonder if there are any developments in the pipeline for this, whether this is something that is handled well in other biodiversity data initiatives or at the level of certain nodes?

I am interested to learn your opinion on this.

Cheers,  Aaike

David Remsen
2115 days ago

Aaike - I know that in the new portal roadmap there will be a focus on supporting the sort of annotation capacity you refer to in b).   I'll see if Tim R is able to provide details here but I agree wholeheartedly that this would be a very useful feature.

In regard to a).  I am in favor of developing services that provide data evaluation processes to alert data publishers of potential issues before the data is actually indexed in the portal.   As more data moves to DwC-A,  I could envision a simple service that reads the archive,  which undergoes various assessments,  and the publisher made away of any issues raised by the service with possible remediation suggestions.

Burke Chih-Jen Ko
2115 days ago

Hi Aaike,

We are indeed working on this and if things go well we should see some implementations go public in the coming year. To address (b), in the new portal development there would be an annotation mechanism that discussions can be associated with a certain dataset, plus email feedback to the data publisher, which is already functioning in current data portal. This will reveal the potential issues in the data in public, so users are aware of.

The other side of improving is like you just mentioned about how GBIF Network works, then it's our documentation and communication that should be further organised so the structure and mechanism stand out.

Quality itself is tied to what data publishers provide, so what GBIFS can do is facilitating the improvement from the source, (a) is an example, (b) is a short term solution on the portal enhancement. For the long run, like the first data paper has just been published in the latest issue of Zookeys, from that we see data need to go through a quality control check before they are published. Academic credits are the incentives here. Hope this starts to raise the overal data quality across the network. Then in my humble opinion, when the data citation and usage mechanisms are also in place, data publishers will need to be proactive on quatliy improvements rather than just waiting for feedbacks. In this respect, there is truly a large room for us all to work on, while it's started progressing.

Cheers,

Burke

Nicolas Noé
2115 days ago

Very glad to hear we'll see something soon about these annotations, I'm convinced it's a very important step to convince more people of GBIF's usefulness.

But I have the feeling that we will need also something else to get the most of these annotations: some versioning mechanism. It's good to know that some names are outdated in a dataset, but it would be better to know that someone said it was outdated at revision 13 of the dataset, but was then greatly improved at revision 15. 

And (I'm moving dangerously out-topic, but all is related), maybe another prerequisites then is some very strong and stable identifiers for any entities in the GBIF network. For example, to clearly reflect that an occurrence that appear in 3 different datasets (republication through aggregators) are in fact the same line, but at different revisions and on living on different, parralel branches (to use "SVN" vocabulary, sorry for non ITs).

Are these topics of interest too for the secretariat ?

thanks in advance,

nico

Burke Chih-Jen Ko
2113 days ago

How about "git", Nico? I can imagine it would be an innovative application like Time Machine does with hard links, although this is at the file level.

I think it is relevant to the quality, and you are talking about how we should support at the imlementation level, right? I think, as you suggest, the versioning will need to be considered before we reveal the annotation feature, or soon the annotation will be disconnected with the dataset. We now just start to cope with the data volume with new processing workflow, so there is now a room for us to experiment and move forward.

A possible idea could be bidirectional feeding of annotation between the portal and data-publishing end-points like IPT, when an annotation is made via the portal, the portal could push the annotation to be part of the resource on the IPT so the publisher is aware of it and can track if the issues are tackled. Not sure wrappers like TAPIR or BioCASe allow versioning locally?

Yes please brainstorm these ideas with us.

Burke

Alan Williams
2113 days ago

DataCite could be a good resource to look at for the identification of data and the specification of metadata.

Alan

Dag Endresen
2113 days ago

Stable identifiers are truly useful when they are assigned as close as possible to the data owner/source/publisher. However if the GBIF data portal is starting to be able to recognize the same record between each reindexing of the same dataset (portal roll-over), which would be required to link the annotation to the new reindexed version of the record, then perhaps a PID (persistent identifiers) for this portal record exposed publicly might be useful service also for other participants in the GBIF network as well...? Any thoughts on this?

Rod Page
2113 days ago

+1 for Dag's comment about stable identifiers. I've argued elsewhere (e.g., https://plus.google.com/104270512787685766370/posts/i9F8KcHfsS1 and http://iphylo.blogspot.com/2009/04/gbif-and-handles-admitting-that.html ) that the lack of stable identifiers for museum specimens is a major hindurance to linking stuff together, and supporting annotations (such as "this specimen has been cited in these papers" and "this specimen is the source for these sequences"). I think this is the major weakness of GBIF, which is ideally placed to provide these identifiers. Part of the problem seems to be the politics of aggregation, but as I've discussed in the posts linked to above, there are ways around this. In other words, mechanisms that would allow GBIF to serve persistent identifiers without too much "branding", and a mechanism to hand those identifiers off to the primary providers if and when they have the infrastructure to do this.

Steve Wilkinson
2113 days ago

Hi Dag
I think the issue here fundamentally revolves around how good the portal can be at spotting duplicate records (e.g. from a dataset that is periodically re-published). Within the UK

  • We here have a large number of data providers that dont have any sort of persistent key for records. So, when they republish their records we drop the whole dataset and reimport. This causes problems when we publish to GBIF.
  • The only way we could create an internal persistent key would be to process the dataset on load and attempt to identify each unique record from its attributes (ie. trying to spot the same record by matching on position, species, date etc). This is a massive processing overhead - and just wont work on our current infrastructure. It might be more feasible for GBIF - I dont know.
  • BUT - it also creates problems for us - especially with regard to annotations. If a user flags a record as bad (we have a facility to allow this) this gets tagged against our internal key. As this key is not persistent when the dataset is next uploaded the annotation is lost
  • One possible way around it that we have thought about but not implemented is to use the record attributes themselves (rather than the key) as the mechanism for flagging the annotations. So let say a user flags a record as dubious. The system actually takes the attributes that make the record (date, place, species etc) and uses this information to identify the dubious records. The real advantage of this (for us at least) is that the same dubious record has a habit of popping up from multiple places - eg. someone publishes it in the literature, lots of people digitise it and publish it electronically. I imagine GBIF has similar issues with duplication - eg. data published through OBIS and direct to GBIF.
  • By using the attributes of the record that have been flagged as dubious, as a way of identifying it brings a couple of advantages:
  • dont have to worry so much about persistent identifiers
  • if the record appears from multiple sources (different identifiers) it doesnt matter
  • if the original record is deleted - doesnt matter - you still hold the annotation so if it later pops up again it is still flagged
  • etc

Not sure how feasible this is elsewhere or indeed how consistent this problem is elsewhere but I though it was worth sharing our thinking

Steve

Dag Endresen
2113 days ago

Thanks Rod for the 2009 blog post! +1

As Steve points out many of the country Nodes and other network aggregators providing data to GBIF may have prolems to link an annotation (or any other linked information resource) back to the actual dataset record. If the annotation would be linked to a PID that the GBIF portal can resolve, then Steve could retrieve those attributes that are indexed by GBIF and in some situations required to recognize the actual data record... Would not such a mechanism be better than (or at least no worse than) to eventually store the data record together with the annotation?

Annotations and comments collected by the UK portal might not easily be associated with such a centrally assigned PID...? But even if PIDs provided by a central portal does not solve all needs, they could perhaps be valuable for many other purposes - and perhaps much better than no PIDs for dataset records at all...? And with Rods suggestion on how they can be more easily transfered towards the data creator/source/publisher, we could see such PIDs gaining value if recognized (and served) by more of the stakeholders.

Rod Page
2113 days ago

Personally I don't particularly care who serves the indentifiers, I just want them. In terms of demonstrating value, one relatively quick way to do this would be to mine GenBank for museum identifiers and do some stats on how many specimens have a digital precence, and how many times individual specimens have been "cited" (that is, sequenced), and what institutions they come from. This was part of the motivation behind http://iphylo.org/~rpage/challenge/www/ which demonstrating linking literature, names, sequences, specimens together. 

Personally I wish GBIF would setp up and become the CrossRef of biodiversity data.

Peter Desmet
2113 days ago

Steve Wilkinson's post gave me an idea. Let me know if it is complete nonsense.

Let's say you make a UUID or hash based on some or all fields of a occurrence record (specimen/observation), e.g. the locality, collector, date and scientific name. As a community we use the same hash function and same fields to generate these UUIDs. Then:

  • Anyone could create UUIDs for records
  • An aggregator and a publisher do not have to share their UUIDs. The same record will generate the same UUID.
  • If I drop my dataset and upload a new one, I can create the same UUID for records that where not changed for the fields on which the UUID is based.
  • I can compare UUIDs to find duplicates within and accross datasets.
  • I can annotate a record via its UUID. This annotation will stay with the record as long as it has not changed for the fields on which the UUID is based. Although not perfect, this makes sense if I'm annotating the record for one of the fields on which the UUID is based (e.g. a record should only stay flagged as having an incorrect scientific name until the record has changed).

These UUIDs are not persistent though (so it wouldn't help for citing them), but they could help to find duplicates or annotating them for a period of time.

Claude Nozeres
2113 days ago

Aaike: these are really important topics for data publishers and I am glad for the discussion. Burke & Nicolas: calls for better quality from providers need to be bolstered with signs that GBIF is useful to indicate potential problems and versioning of records. Are there recommended tools for managing databases at source level?  RE: Dag on PIDs.

For example, I am intrigued by the recently-released Specify 6.4 which has a write-back module ('Scatter Gather Reconcile') to take records from GBIF and populate the local database. Is this standard practice or actually completely new? It seems to me that it could revolutionize the use of GBIF data: can contribute, but now also identify already-published records, thus avoiding duplicates--a major quality issue with GBIF at present, like museum data as has been suggested. Next, can push these records to local databases in biodiversity labs. Versioning, quality flags, and identifying duplicates would help enormously with the maintenance of local databases in our labs. In turn, this could ensure we publish more quality data, more often, to the IPT.

 http://specifysoftware.org/content/specify-64

 I am curious if anyone has used this 'SGR' and how it deals with stable identifiers and aggregation, RE: Dag & Rod.

 +1 Rod. It seems like IPT and DwC-A are current examples of successful data publishing tools and standards, but that we also need a group for referencing records. If I understand Rod's comment, perhaps, it is not enough to have a persistent identifier standard like 'DOI' --we also need a body to actively manage such PIDs, like 'CrossRef' reliably does with publications. 

Rod Page
2113 days ago

DOIs (or any other identifier) are only part of the story, the other component is service. CrossRef works because:

1. There are stable identifiers for publications (in this case DOIs)

2. These identifiers can be resolved to human-readable content (an article) or computer readable metadata (various flavours of XML)

3. Given metadata about a paper I can find a DOI (the equivalent for a specimen is discovering the identifier from the museum code). This is how publishers convert the articles we cite at the back of our papers into clickable links

4. If a DOI breaks CrossRef can fix it (or at least you can complain to them about it)

GBIF is ideally placed to provide these services because it has large amounts of specimen data cached, so if a primary provider falls over GBIF could serve the metadata. 

Dag Endresen
2113 days ago

@Peter, I believe that the European Genebank portal (EURISCO) was and is using a similar approach to MD5-hash a triplet of values for specimen records. Here the triplet is the ID (centrally managed) for the institute holding the specimen or genebank accession (1), the genus name (2) and the accession number (3) is the input for the MD5 hash. The hash string is used to identify the individual records between roll-over (reindexing of datasets) for this genebank portal database. A similar approch was also earlier used for the European Barley Database. And also GBIF identifies the individual records based on such a similar triplet of attribute values. However different types of resources have different stable (or relatively stable) attribute value combinations. There might eg. be biodiversity resources that are not part of a collection maintained by an institute, and possibly resources that do not have a genus name(?). So each data type would need different protocols to generate the MD5 hash. However in lack of something else such an approach has been demonstrated to work.

Nicolas Noé
2113 days ago

Very interesting and rich discussions happening here. We're really out -topic now, but I'm glad this takes place anyway. I have to admit I need to take a few hours more to think slowly about all this before being able to a useful point of view at this level of technical detail. So, I will for now just share some general opinions about this:

  • This "better IDs" topic is so central and fundamental to a project such as GBIF that we HAVE to provide a solid solution.
  • At the same time, that's a VERY complex issue (at every level: code complexity/performance/workflows/interaction between providers/bif/secretariat/aggregators), where  the 100% perfect solution probably does not exists. So we'll have to recognize it, and decide what is "good enough" instead of spending 10 years thinking of an unimplementable, non realistic solution.
  • The more we wait, the harder it will be to integrate with existing data and infrastructure.
  • It's the right time to do it. It was probably not really possible (nor top priority) before the new processing workflow in the secretariat, and it will be mandatory to :
  • implement next steps (such as the annotation mechanism)
  • accomodate new data types and bigger volume of data
  • eventually increase quality to convince all the people and institutions that will need to convince (give a bright future to GBIF!)

We will need to observe what others have done, and to think out of the box:

  • We definitely have to keep DOI's in mind to "reinvent the wheel" only in cases when it proved necessary. Arriving after allows us to learn from their success and from their mistakes.
  • From a purely "programming" point of view, the inspiration to solve all this is probably more to find in Git, SVN or a distributed filesystem than in traditional biodversity informatics apps.

These are just random ideas. They may very well be complete nonsense. They may also be so simple/common sense than it's not worth writing them. But I'm glad this discussion comes on the table.

Alberto González Talaván
2113 days ago

I assume that you are all aware that we are organizing a course on persistent identifiers next February 2012. We expect that similar rich discussions as this one will happen there. The deadline for nominations by GBIF Participants is tomorrow (Firday 9 December), so I hope that all of you interested in participating on it have already contacted your respective node manager or head of delegation!! :)

Peter Desmet
2113 days ago

@Dag: Cool! Glad to see that this can work.

@Nicolas: Summarizing is good. Thanks!

@Alberto: I didn't submit my nomination for the course, because 1) I think it's too soon to train people in the use of persistent identifiers, 2) I had the feeling it's focussed on LSIDs and 3) I won't have the time in February. But from your comment I gather that this will be as much a discussion about persistent identifiers as a training course. Curious to see the results and happy with the discussion above. Did you get a lot of nominations already?

Lutz Suhrbier
2112 days ago

Dear all,

obviously my recent posting is not visible, so apologise if it appears twice now.

First, I would like to let you know that we are currently in the specification phase of a german DFG funded project dealing with online annotations for biodiversity data.(http://wiki.bgbm.org/annosys, web site will have content beginning next year, sorry).  We are mainly focused on the BioCASE-protocol and thus, the ABCD and DarwinCore specimen dataset formats.

The aim is to provide annotators by an interface enabling them to "edit" specimen datasets and or add comments to any dataset element in order to "propose" these modifications to the collection's curator. The curator itself shall be enabled by an interface supporting him to accept/reject or further discuss annotation, and to support him to take over annotated values to produce a new revision of the dataset within its local data provider. Due to many technical restrictions, I guess taking these modifications over directly into the database, will only be possible for specific, dedicated data providers. Futher on, a message system is envisioned, to keep annotators, curators and other interested subscribers informed about the current status of annotations (e.g. new annotations for given species, curation status of annotated specimen etc.).

 

Currently, I am working on a technical specification to realise these aims.I can confirm that persistent Identifiers are a crucial point for linking datasets and annotations together. In our scenario, we even need to go on step further. We would appreciate having a service which can archive specimen dataset documents, and produce GUID or UUID or whatever unique indentifiers also respecting the time, where a document of a certain revision was last seen, and the data format that document was represented (e.g. ABCD, DarwinCore).

Using hash-function (MD5 or SHA-xxx) is a good idea in order to uniquely identity complete documents. But, it does not respect the format of the document, or any data elements must be somehow normalised (which appears to be a huge work, even if only considering ABCD and DarwinCore nomalisation). So,my proposal would be to combine the hash function over a complete document, with a species identifying GUID, a timestamp and the format id determining the specimen dataset format standard used.

I am looking forward to further opportunities to discuss those topics like this forum.

 

 

 

Aaike De Wever
2112 days ago

Dear all, I am really impressed by the amount of discussion this topic has generated [I have been in a meeting since Thurday and only now found the time to go through the discussions in detail]. I guess it really shows it's a "hot" and important thing to consider for the process of data publication.

Comming back to one of the first reactions, I actually agree with David Remsen that it would be a very good idea to make the data provider aware of any potential issues with the data at the level of the publication itself or in other words in the IPT tool.

Together with the annotation mechanisms discussed (which may indeed require some kind of versioning mechanism and stable ID as amply discussed) and the peer review of datasets when published through a data paper seem really important tools to ensure better data quality.

Dag Endresen
2109 days ago

+1 for Rod's blog post and commentary on the "Darwin Core Triplet" named as a "failure to learn from past mistakes": http://t.co/QZACPtmK

Why are everybody seemly so afraid of assigning and starting to use persistent identifiers, when everybody is seemingly in such harmony of the great need for them...?

A MD5 hash or similar as discussed here, might be a useful approach to identifiy entities as an behind-the-scence hack - in lack of the missing PIDs provided from the source data owner. However as soon as an enitity is identified the MD5 hash might be less useful even as an internal identifier. And most likely even less useful as an identifier shared with other network partners.

Hannu Saarenmaa
2109 days ago

I think people are afraid to start using persistent identifiers because they would intervene with the tradition of relying on coden providers such as Index Herbariorum. This tradition is very long, and hundreds of millions of specimens already do carry these identifiers.

We are assigning persistent identifiers in our digitisation process.  Take a look at this specimen http://morphbank.digitarium.fi/?id=3500032  This image link is stable, but there is a link to a specimen database with more details, which also is stable.

We have noted that the tradition in botanical community and the GPI, which requires use of Index Herbariorum Code in specimen barcode/URI is causing friction in use of these kind of "content-free" identifiers. The DwC triple is favoured there. Such a triple could probably be embedded in a persistent identifier, but at the risk of institutional changes.  In order not to alienate users, we are currently trying to find a compromise between the content free URI and DwC-type URI.  I.e., will it make sense to embed the DwC triple in an URI?

Rod Page
2108 days ago

Hannu,

 "I.e., will it make sense to embed the DwC triple in an URI?" 

Short answer: YES!

Long answer:

There need not be much friction, existing identifiers can be used as the basis of persistent identifiers (for example by simply adding a globally unique namespace as a prefix). Indeed, many DOIs for articles are created in this way. Journals took their existing identifiers and added them to their DOI prefix.

IMHO "content-free" identifiers (I guess these are the same as "opaque" identifiers) are a huge red herring, and people stating that identifiers must be opaque need a good slapping. Opaque identifiers are not needed, indeed some would argue that "hackable" identifiers are a good thing. What matters is not that identifiers are opaque (i.e., we can't interpret the parts to mean anything), but that we don't assume that we can interpret them

For example, http://bit.ly is an identifier, and if I know something about Internet domains I can infer that the ".ly" domain is the top-level domain for Libya. I can't, however, infer that the web site with the URL http://bit.ly  is based in Libya.

So, there's no good reason not to take, for example, Darwin Core triples, stick a globally unique prefix in front, and serve those. Over time the components of the Darwin Core triple-based identifier may likely loose their meaning (the specimen may move to another place, people may forget what "MCZ" stands for, etc.), but that's fine because we don't need that information to use the identifier.

You say that  http://morphbank.digitarium.fi/?id=3500032  is "persistent" - in what sense. It still has too much unnecessary technology-specific content for my liking. Why not rewrite it as  http://morphbank.digitarium.fi/id/3500032  (that way we don't need to understand that "?id=" is a URL query parameter)?

Also, the assumption being made is that the domain morphbank.digitarium.fi will exist forever (or as long as relevant). Obviously we can't guarantee this, but we could ask what would happen if the domain expired, the project lost funding, etc. What would then happen to URLs that start morphbank.digitarium.fi ? 

One approach to this problem is the computer scientists' friend indirection. Have an identifier that can be pointed to a new place if needed. This is the basis for DOIs, as well as PURLs. I think as soon as we start thinking about persistence we pretty much end up at this point (and others have gotten here before us).

Hannu Saarenmaa
2108 days ago

Thanks for a clear answer, Rod. It makes me more confident to accommodate DwC triple and IH coden in the URI.

Of course the longevity of morphbank.digitarium.fi can be questioned. If our project ends, I trust the Finnish Museum of Natural History, or rather, morphbank.net through one of its mirror sites will swallow and serve our URIs and our data. Morphbank ID numbers are unique and have been allocated globally to the various mirrors.  Beyond this, if the entire Morphbank project should terminate, a major natural history museum would probably take it over the domain and data. These museums are built to last.  Funny, tough, the Finnish Museum just changed its domain name...

Rod Page
2108 days ago

I guess the key thing is can we simpy hope someone will look after it, and if we want that to happen, how can we design indentifiers that make that process easier. 

I recommend reading Martin Fenner's interview with Geoffrey Bilder (from CrossRef) http://blogs.nature.com/mfenner/2009/02/17/interview-with-geoffrey-bilder for some background on these issues.