Interest group about data quality, cleaning and fitness-for-use. TDWG data quality interest group.

Share |
Group discussion > Summary of DQIG meetings during TDWG2016, Costa Rica

Summary of DQIG meetings during TDWG2016, Costa Rica

Arthur Chapman
412 days ago

The Data Quality Interest Group had a very productive week before and during the TDWG2016 Symposium and Workshops in early December, firstly at the Hotel Arenal Manoa near La Fortuna de San Carlos, Costa Rica, and then at CTEC in Santa Clara de San Carlos, Costa Rica.

Sunday Meeting

A key group of DQ Team leaders and specialists met on the Sunday at the Hotel Manoa to identify issues and to discuss possible future directions.


Two symposia were run with 12 X 15-minute talks. One in the Plenary session on Monday and one on the Tuesday afternoon. Both were extremely well attended, and feedback was enthusiastic and positive. Papers can be seen on the TDWG website. We thank all the presenters for their efforts and for keeping to the time allotted them. These talks were very valuable and it was great to see so many presentations linking back to the Framework.

Working Group Meeting

On the Wednesday morning, we held a working group meeting that was attended by over thirty people. There was some robust discussion, valuable ideas put forward and quite a number attendees volunteered to help with aspects of our 2016-2017 workplan.


Task Group 1: Framework for Data Quality (Action Items: Allan Koch Veiga)

Controlled Vocabulary

  • Need to Improve the definitions
  • Establish good Biodiversity examples that, where possible, conform with other documents being prepared by the Interest Group.
  • Examine the literature to ensure consistency in definitions
  • Liaise with the Vocabulary Management Interest Group to ensure conformity with the documents being prepared by that group
  • Establish a process for collaboration and examine the best way to manage and share the document (NB check documents being produced by the Vocab Management IG).
  • ---- From the Vocab management document, we are probably looking at Skos for the controlled vocabularies (e.g. controlled vocabulary for DQ Dimension), and an rdf/(?owl) representation of the framework itself probably along with several human readable normative and non-normative guidance documents. [Paul Morris]
  • Some suggested changes (not concrete yet and need to be circulated for comment)
  • ---- Distinctness rather that Uniqueness
  • ---- Multirecord rather than Dataset
  • ---- Amendment rather than Improvement
  • ---- Filtering versus Validation in Profiles (i.e. in the subset – still under discussion)

Extension of the Framework

  • Need to define the problems
  • Need to further discuss terminology in terms of the Framework
  • Map the duality of concepts from the point of view of Data Quality and from point of view of Data Problems/Errors.
  • Vocabulary for reporting error cases


  • Need to develop a common metadata schema
  • Create a computational platform

Training and Outreach

  • Need to develop outputs and training at different levels to convey the information about the Framework – especially to users.(Dmitry mentioned that the Profile may be scary to many Academic users and that we should market the profiles, not the details)
  • Need a one page document on the Framework

Additional Items

  • Refine the Tool for creating Data Quality Profiles from Use Cases.
  • Refine and foster use of Conceptual Framework in context of TDWG/GBIF community and the broader research community (User needs survey/Use Cases)
  • Need a name for Framework (e.g. F4U Framework, FIT4U Framework)

Task Group 2: Tools, Services and Workflows (Action Items: Lee Belbin)

The Spreadsheet with Current List of Tests is here

Name of TG2

Recommended to rename the Group to “Tests and Assertions”. Following the meetings, Arthur emailed the TAG and received a response on the process: "A Task Group charter can be revised at any time and submitted to the Secretary as a draft. A draft revision may reflect minor corrections or clarification, or may reflect a change in operation or purpose. The Secretary may approve minor changes, but must distribute any substantive revision to the Executive Committee for review and approval before it replaces a prior charter." (http://www.tdwg.org/about-tdwg/process/ )”. We need to amend the current charter to reflect agreed activity focus on tests-assertions. Paul Morris suggested that the ‘Services’ component of the original charter is better placed with the TDWG Biodiversity Services and Clients IG. Lee Belbin/Arthur Chapman.

Follow up from the TDWG meetings

  • Follow up on feedback from meeting and subsequent emails
  • Develop Use Case to define “core” terms
  • Assign tasks to actors as identified in DQ spreadsheet at TDWG workshop (see https://docs.google.com/spreadsheets/d/1td7zJ9GH3WWhu0Pa1X-1fkaWk71U8qqr54-kkbfwbfE/edit#gid=1168468409).
  • Review Descriptions for structural and term consistency, conciseness and readability (Shelley James)
  • Review Principles ensuring consistency and ordered from broad to narrow. These will form the sections/paragraphs of a paper and presentations at TDWG (Lee Belbin)
  • Review tests and add a column for "ease of implementation". Maybe a 1-5 scale? I expect this will reflect complexity as we move from "term" to "external" but it would be wise to classify then reflect as this information may suggest at least a two-level implementation - core and extended? (Alex Thompson)
  • Develop a test dataset that will exercise all tests in- Darwin Core Archive format. Start with real data and generate artificial variants. Both types of data will be available. (Alex Thompson, Paul Morris, Lee Belbin)
  • Write stand-alone code that implements the tests against the test dataset. Ideally, the code should be as simple as possible and documented so that it can be picked up in a modular fashion and implemented broadly. Kurator has some sample data sets, and will build more. See, for example, the example data sets listed at http://wiki.datakurator.org/wiki/FP-Akka_User_Documentation#Or:_Small_example_data_sets and https://github.com/kurator-org/kurator-samples. A body of Java Code that implements a set of validations and amendments on Event terms developed for exploring expressing the TG2 tests in terms of F4UF can be found at: https://github.com/FilteredPush/event_date_qc [Paul Morris]. (Paul Morris and John Wieczorek to lead). Abby suggested having an analysis following the cleaning (before and after). Alex and iDigBio well advanced in writing code. Action Item: Alex Thompson to liaise with ALA/GBIF and others.
  • Classify Darwin Core Terms into classes that reflect testability, e.g., verbatim, vocabulary, range checkable (Alex Thompson)
  • John raised option of 2-3 standard tests applicable to each Darwin Core Term. Using #15, can we write a concise high level logic? If we had tables like the tests spreadsheet, but at an atomic level (DwC Term "Type", lookup values such as vocabs or range limits), we could generate test code. For example if Darwin Core Term is decimalLatitude and is of type "has range bounds" and the corresponding range bounds entry is >=-90<=90 then we could generate the test code. BUT, that is only for tests of type "Term", and "External", not "Muliterm" (where permutations of combinations could grow complex) (Alex Thompson)
  • Write paper, post feedback to TDWG community (Lee Belbin to outline)
  • Recommendation on implications from tests for Darwin Core, e.g., "It would be great to maintain a community vocabulary for unambiguous lookups (e.g., https://github.com/tucotuco/DwCVocabs/blob/master/vocabs/basisOfRecord.txt)" (John Wieczorek)
  • Establish GUIDs for Tests/Assertions/Vocabularies (See Principle #6) (Alex Thompson)
  • Review how best to handle scale (coastlines for terrestrial/marine, etc.) and buffering for Centre of Country (Arthur Chapman +)
  • Refine terminology following discussions at TDWG (All)
  • Split DwC terms into two columns: “Target” and “Contributing” (Paul Morris) ###Circulate to the core group for feedback
  • Review Tests and assertions All to Review; Paul Morris to Review but not on Google Docs
  • Review References for additions and deletions All to Review (NB Only key references)
  • Review criteria for inclusion of Tools for additions and deletions All to Review
  • Circulate refined document to as broad an audience as possible (Once we have a reasonably stable spreadsheet - the ‘straw man’) (Lee Belbin)
  • Possibly use a tool such as Mendeley and/or Zotero: list of authors and datasets based on GBIF data, that might be approached to look at the tests (Dmitry Schigel to extract)
  • Base on GitHub and generate html readable documents. (Alex Thomspon to lead the putting up of the spreadsheets for public access and invite feedback)

There needs to be a Machine Readable way to link to the tools.

  • Anton Güntsch mentioned a Web services location where Tools could be placed? Action Item: Follow up with Anton. Paul noted that this is the BSCI Service Registry: http://www.biodiversitycatalogue.org
  • Alex was showing a slide which mentioned this the following day Action Item: Alex Thompson to expand on this?

Task Group 3: Use Case Library (Action Items: Miles Nicholls)

User Stories -> Use Cases -> Profiles

  • Currently two entry points
  • ---- User stories (Google Form)
  • ---- Spreadsheets (More complicated)
  • Need a tool to refine collecting and defining Use Cases
  • Need to engage with Specific Communities to elucidate Use Case Stories (Action item: Shelley James for iDIgBio community).
  • Use Case stories in Natural Language and then need translation to spreadsheet
  • Need to visit terminologies to make sure consistent across the board (with TG1 and TG2)
  • Suggested that there will be a lot of overlap and convergence between use cases, so that another step is probably needed – i.e. to find commonalities and classify Use Cases and then these would lead to the Profiles, rather than have lots of similar Use Cases. Advantages to users/publishers would include:
  • Adherence by users as a way of compliance with a well-defined process, leading to criteria of quality
  • Publishers to provide tools for users to get their good data
  • Papers could point to a standard methodology and profile, and that would also indicate quality User cases – user stories

Vocabularies (Action Items: Paula Zermoglio)

There was discussion on how best to handle vocabularies, look-up tables, etc. – especially those linked from the Tests and Assertions. Paula Zermoglio offered to lead this process. It was decided that it may be suggested to the Specimens and Observations Interest Group that a Task Group be formed under that Interest Group and that a Task Group was not appropriate under the DQIG at this stage, but that we would need to continue to monitor that.

  • Discuss Community Vocabularies, and explore possibility of a standard repository (e.g. It would be great to maintain a community vocabulary for unambiguous lookups (e.g., https://github.com/tucotuco/DwCVocabs/blob/master/vocabs/basisOfRecord.txt)
  • Develop GUIDs for vocabularies
  • Need to liaise with Vocabularies Management Interest Group. Several of us attended their meeting.
  • Vocabulary for reporting error cases and for reporting pre-enhancement/post enhancement status (Action Item: Allan Koch)
  • Propose Task Group under the Specimens and Observations Group (Action Item: Who…?)
  • DwC – Necessary to give clear definitions of terms to specify tests
  • Action Item: John Wieczorek in his spare time to examine required lookup tables
  • There needs to be Community support to maintain and develop vocabulary: TDWG to support. NB link to Vocabularies Management Interest Group.

Other Items

  • Liaison with Annotations Interest Group (Paul is Convenor). Several of us attended the Annotations Interest Group meeting, and it was suggested we attempt a joint meeting sometime – possibly in conjunction with the SPNHC meeting in Denver in June.
  • Establish BDQ GitHub – Allan has agreed to help Arthur to set this up. Action Item: Arthur Chapman and Allan Koch to establish structure
  • Expansion and implementation
  • Development of Training materials
  • ---- Audience(s)? [Collections, data managers, researchers - graduate and professional level, General Users] Liaise with groups working on curriculum (Hanna, Town Peterson, etc.)
  • ---- Incorporate into Data Carpentry
  • Implementation by Data Publishers
  • ---- Send Data Publishers information about the Framework etc. and have them buy-in
  • ---- Talk to Pensoft about incorporating DQ aspects into data published /papers
  • Getting buy-in
  • ---- Advertising the work of the IG to different audiences
  • Tying assertions to data on download
  • ---- Need recommendation that assertions are downloaded by default and that users need to actively turn it off if they wish
  • ---- Need training to explain the implications
  • Feedback Mechanisms (Action Item: Paul Morris to lead)
  • ---- List of feedback mechanisms (Format, Tools, Annotations, etc.)
  • ---- Scope of what is fed back?
  • ---- Paul Morris has since reported that he has raised this issue as an Issue in the Annotation Interest Group GitHub (https://github.com/tdwg/annotations/issues/4)
  • Expansion for DwC extensions
  • ---- Something that may be considered later but not a priority now (Extensions might be Use Cases themselves)
  • ---- Paul suggested that there are a number of areas where we need to be explicit - especially tests related to dates.
  • Paul Morris reported that “we did look briefly at the proof of concept representation of a subset of TG2 tests in terms of F4U Framework from the Kurator project (https://github.com/kurator-org/kurator-ffdq done in large part by David Lowery working with code by Allan and Paul.”


It was agreed that may need to consider two standards by TDWG2018

  • Definitions
  • Tests and Assertions

TDWG has several types of standards

  • Technical specifications
  • Applicability statement
  • Best current practices
  • Data standards
  • Valid values and controlled vocabularies

Meeting between TDWGs

  • Need to pursue funding
  • TDWG will send out a Call for Proposals for its Community Support Fund early in the New Year.
  • Partners
  • ---- TDWG
  • ---- GBIF
  • ---- ALA
  • ---- iDigBio
  • ---- Kurator – John
  • Agenda – to be determined
  • Location – possibly in the Pacific in conjunction with GBIF BID Project We also hope to have a joint meeting with the Annotations Interest Group (SPNHC, Denver in June 2017) and plan a pre TDWG Meeting in Ottawa (Sunday 1st October 2017).

Future Special Issue Publication

Rob Stevenson suggested that we need a paper with an example to showcase the Framework. Following the meeting, Antonio and Arthur discussed the possible of a Special Issue publication highlighting the work of the Interest Group and the Framework and its use in various projects. On discussing this with a few members of the Group we received positive response. Possible topics include:


  • An overarching article about the process
  • Citizen science
  • Agrobiodiversity
  • Species Distribution Modeling (SDM)
  • Invasive Alien Species (IAS)
  • IDigBio
  • Atlas of Living Australia (ALA)
  • Taxonomic
  • Genetic
  • Marine
  • GBIF
  • Trust and risk assessment
  • Mycology
  • Interactions
  • Paleobiology challenges - integrating neo and paleo data implementing DQ
  • Others arising out of Use Cases