Home
UsernamePassword
Pages home > GBIF Knowledge Organization Systems White Paper Comments

GBIF Knowledge Organization Systems White Paper Comments

This current community page is also accessible as http://bit.ly/GBIFKOS_Comments.

The current GBIFKOS Draft White paper is also accessible as 
http://bit.ly/GBIFKOS_2010-11-25-0400 

The GBIFKOS White Paper is a set of recommendations to GBIF for the deployment and community development of Knowledge Organization Systems (e.g. Controlled Vocabularies, Ontologies, Thesauri, etc., and their management) for biodiversity information systems.

You can fetch a copy of the draft White Paper HERE. Instructions for commenting on it are below.

Your comments are meant to inform the White Paper authors and may or may not be incorporated or addressed in the final draft. However, all will be available to GBIF and become part of the public record.  In email comments you can indicate whether you wish that  your comments not become public.

Regrettably the time for initial public comment is very short. Anything not received by December 7, 2010 will not influence the final draft of the White Paper, though will remain available for GBIF to consider as it acts on the White Paper recommendations.

There are several ways you can comment:

  1. Add  comments to this page. Please be as explicit as possible, especially as to what part of the White Paper your comment addresses. One comment per issue is helpful, though not required.
  2. You can join the KOS mailing list kos@gbif.org by subscribing at the kos list info and commenting on the list after your subscription is activated. Also one issue per post is most helpful. 
  3. By annotating a copy of the White Paper  as a Word or OpenOffice document  and uploading it here with a brief comment that you have done so, and preferably also tag it with "kos". Absent such notice, we may not realize you uploaded something. Alternatively, if you have subscribed to kos@gbif.org, you may attach it to a post to that list.

We appreciate your comments and look forward to considering them for our advice to GBIF

Bob Morris, for the GBIFKOS White Paper authors.

, , ,

Last updated 1335 days ago by Bob Morris


Thank your for the report - very interesting. 

I notice, however, that under the Heading "Current State of Biodiversity KOS" that you haven't mentioned the considerable wok being done by the FAO - especially with respect to their AGROVOC Thesaurus (see for example: http://aims.fao.org/website/Ontology-relationships/sub) and their KOS Registry (see http://aims.fao.org/en/website/KOS-Registry/sub).  You do mention the Bioversity International work of Crop Wild Relatives, and much of this was done in conjunction with the FAO as part of a UNEP-GEF project.

Also in the Tool Development area, you may wish to look at the AGROVOC Concept Server Workbench (http://aims.fao.org/website/AGROVOC-Workbench/sub2) which they cite as a "Tool that shall help to build and structure multilingual ontologies and terminology systems in a distributed and collaborative environment."

Also in the Linking area, you may wish to look at the NEON project (http://aims.fao.org/website/NeON/sub214.7 million project. To quote: "The aim of NeOn is to advance the state of the art in using ontologies for large-scale semantic applications in distributed organizations; and to create the first ever service-oriented, open infrastructure, and associated methodology, to support the development life-cycle of this new generation of semantic applications with economically viable solutions."

I am not sure how any these may fit into the GBIF projects, but they should at least be cited.

Hope this helps

Arthur D. Chapman

Bob Morris 1334 days ago

The content of your report does not seem to match the scope specifed in GBIF's request for proposals.

First, you've included suggestions that do not follow from the report guidelines.  If I understand the RFP correctly, it's asking you to address the problem of effective production and use of resources (vocabularies, ontologies, etc. - I am not sure what 'KOS' means so I will say 'resources') *in general* so that we get insight that can be applied across the board.  You have a number of items that relate particular resources, and I think these are out of place.  In particular I'm referring to the section 'GBIF participation in KOS standards development' items b, c, d e, f, g.  While these are worthy efforts, they all face common problems.  It is the common problems that should be the subject of the report and its recommendations. Forming a bunch of separate incubation groups is not necessarily going to help achieve the kind of economies hinted at in the call.

Second, GBIF has asked for recommendations on governance, multi-lingual vocabularies, persistent identifiers, and dealing with heterogeneity of "modeling" approaches, but you don't touch on any of these.  These topics are difficult and important.  There is much to say, and much to be figured out.

In fact many of the areas listed in the bullets under 'guidelines' in the call are not discussed at all in your report.  For best service to GBIF and its community I would ask that you go over these and make sure that all topics are addressed, if only to say that more work is needed.  Here's the location again:

http://www.gbif.org/communications/news-and-events/showsingle/article/request-for-proposals-for-a-position-paper-on-vocabularies/

Best
Jonathan Rees

Jonathan Rees 1324 days ago

Could you provide a definition of KOS?  I'm a bit confused as to whether, say, DwC and ABCD are KOSes, or for that matter RDFS, OWL, XML Schema, or SKOS. Before reading the GBIF call I thought only the latter set were KOSes, but now I wonder if only the former set is. Your report seems to apply it in both senses.

I'm not sure how a set of definitions or a thesaurus would qualify as "knowledge", since definitions aren't falsifiable. If you want simple vocabularies or fiat classifications to be KOSes I suggest you define KOS as a term of art.

Personally I would not call say KOSes are a "discipline", but I'm sort of ornery that way. A little evidence to back up this characterization would be helpful.

Definitions of as many other terms as possible would be nice, e.g. "information", "semantic", "concept map".

Best
Jonathan

Jonathan Rees 1324 days ago

Saying that 'linked data' is "a somewhat ill-defined concept" seems out of place in this report (see previous comment). Besides, compared to many terms in this arena I think 'linked data' is exceptionally well-defined - the idea is laid out simply here: http://www.w3.org/DesignIssues/LinkedData and this seems to be the way the term is used in practice. I recommend you delete this characterization, and refer the reader to TBL's design note for a definition. The approach may be open to criticism but I don't think ill-definedness is among its problems.

Best
Jonathan

Jonathan Rees 1324 days ago

The bar graph in the 'KOS familiarity' section is intriguing but I found it hard to match the bars up with the text ("familiar or very familiar") and I found the key to be confusing. One or more what?

Best
Jonathan

Jonathan Rees 1324 days ago

I thought that it might be useful to provide some initial comments on the GBIF KOS Report.


There are several issues but I will mention only a few in this email.


The first is "There appear to be no systematic attempts to develop use cases, competency questions,  or other goals for use of KOS in the biodiversity informatics community."


What about these resources and efforts that have been going on for several years?


http://about.geospecies.org/



http://about.geospecies.org/sparql.xhtml   http://www.taxonconcept.org/example-sparql-queries/



http://www.taxonconcept.org/



Note that this seems to be the only open SPARQL endpoint that is devoted to biodiversity informatics.


http://www.taxonconcept.org/sparql-endpoint/



It is also the SPARQL endpoint for a number of the data sets that are mentioned.


It also has the only examples which use the "IETF scheme for URIs for geographic locations" mentioned in the report.

Also this: "there appears to be no semantically enabled discovery of these resources.  Work across subdisciplines is hampered by this, as scientists haphazardly locate resources which may or may not be the most fit for their purpose. For example, a field biologist made aware of ITIS might never become aware of its relationship to the Catalog of Life."


This RDF snippet is from this record ( http://lod.geospecies.org/ses/v6n7p.html ) one of several thousand that have been around for years.



By querying one of the various LOD services a human or machine would find this interlinking.



    <skos:closeMatch rdf:resource="urn:lsid:ubio.org:namebank:105509"/>

    <skos:closeMatch rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

    <skos:closeMatch rdf:resource="http://www.uniprot.org/taxonomy/9696"/>

    <skos:closeMatch rdf:resource="http://bio2rdf.org/taxon:9696"/>

    <rdfs:seeAlso rdf:resource="http://bio2rdf.org/taxon:9696"/>

    <skos:closeMatch rdf:resource="http://dbpedia.org/resource/Cougar"/>

    <rdfs:seeAlso rdf:resource="http://dbpedia.org/resource/Cougar"/>

    <skos:closeMatch rdf:resource="http://rdf.freebase.com/ns/guid.9202a8c04000641f800000000008b5a3"/>

    <skos:closeMatch rdf:resource="http://sw.opencyc.org/concept/Mx4rvVj5o5wpEbGdrcN5Y29ycA"/>

    <skos:closeMatch rdf:resource="http://www.bbc.co.uk/nature/species/Cougar#species"/>

    <rdfs:seeAlso rdf:resource="http://www.bbc.co.uk/nature/species/Cougar.rdf"/>

    <geospecies:hasGBIF>13815711</geospecies:hasGBIF>

    <geospecies:hasGBIFPage rdf:resource="http://data.gbif.org/species/13815711"/>

    <foaf:page rdf:resource="http://data.gbif.org/species/13815711"/>

    <geospecies:hasITIS>552479</geospecies:hasITIS>

    <foaf:page rdf:resource="http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&amp;search_value=552479"/>

    <geospecies:hasNCBI>9696</geospecies:hasNCBI>

    <foaf:page rdf:resource="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9696&amp;lvl=0"/>

    <geospecies:hasBioLib>id1995</geospecies:hasBioLib>

    <geospecies:hasBioLibPage rdf:resource="http://www.biolib.cz/en/taxon/id1995"/>

    <foaf:page rdf:resource="http://www.biolib.cz/en/taxon/id1995"/>

    <geospecies:hasBBCPage rdf:resource="http://www.bbc.co.uk/nature/species/Cougar"/>

    <foaf:page rdf:resource="http://www.bbc.co.uk/nature/species/Cougar"/>

    <geospecies:hasGNI>505310</geospecies:hasGNI>

    <geospecies:hasGNIPage rdf:resource="http://globalnames.org/?search_term=id:505310"/>

    <geospecies:hasWikipediaArticle rdf:resource="http://en.wikipedia.org/wiki/Cougar"/>

    <foaf:page rdf:resource="http://en.wikipedia.org/wiki/Cougar"/>

    <geospecies:hasWikispeciesArticle rdf:resource="http://species.wikimedia.org/wiki/Puma_concolor"/>

    <foaf:page rdf:resource="http://species.wikimedia.org/wiki/Puma_concolor"/>

    <geospecies:hasToLPage rdf:resource="http://tolweb.org/Puma_concolor"/>

    <foaf:page rdf:resource="http://tolweb.org/Puma_concolor"/>



I would also like to address this statement "For example, at this writing, LOD statistics reveal only 42 bioscience datasets holding 2.7B triples"

The Linked Open Data set list as , are only those LOD data sets that are documented here. http://www.ckan.net/

The biodiversity tags sets are here: http://ckan.net/tag/biodiversity

The Bio2RDF data set is over 15 billion triples on it's own http://www.slideshare.net/micheldumontier/bio2rdf-and-beyond


The authors of the report don't seem to be aware of the significance of the Linked Data movement.



FaceBook's OpenGraph is also Linked Data, so all those "liked" pages are linked data.


Here is an interesting recent blog post from O'reilly Radar.


The Linked Data Community is much larger and more significant than the GBIF report has implied.

Also you should look at what resources show up when you query "“Quercus alba” in the following LOD services.


* Note the txn site I down today. I am reloading the data sets on this endpoint Monday December 6, 2010.

Would seem that this would have been a little hard to miss?


I was originally somewhat skeptical about Gregor Hagedorn's earlier statement.

"I fully believe you and all who are doing it do it with careful
consideration of the needs as they see it. I just believe that those
taking these decisions have a specific perspective and use case
scenarios, that involves biologists only after the perfect software
user interface system is finished. I challenge the last assumption ...
"

But after reading through the GBIF report, I think he make a good point.

Reasoning will only work everyone has a common conceptualization of what each of this things are, and how they relate to other things.

What is a species?

What is a species relationship to a particular classification hierarchy? Can there be be more than one hierarchy?

The report authors have experience in creating highly engineered systems where each entity is modelled in a formal strict way.

It is not clear to me that we have agreement on some of the fundamental entities, let alone how the relate to each other.

In my own work I have been thinking that the model of a species for occurrences etc. might not be the best model for addressing phylogenetic questions.

This is because you might want to have one standard agreed on classification so you can search for species in a given family that are potential pathogen vectors.

Others might want a different kind of entity that is not tied to one particular classification so they can address phylogenetic questions.

What you don't want is to prevent people from asking questions that relate to subfamilies etc. because the model does not support them.

It might be best to have separate, but loosely linked models for these different kinds of "perspectives"

This is similar to the way I model relationships between various LOD species entities.

For some uses you could interpret 


For other uses, they are not the same thing. (The later is in a nested set of NCBI classification subclasses)

Because of this, I tie them together loosely with a skos:closeMatch.

This keeps them "findable" without entailing the other entities potentially incompatible conceptualization.

This allows the end user of the data to determine if they want to convert these to a owl:sameAs relationship or not.

There are a number of other related efforts that don't seem to be mentioned here. Two that come to mind are NatureServe and eBird. The set of approximately 100 people seems very small and I suspect that it does not capture a representative sample of those working with biodiversity data. 

Respectfully,

Pete DeVries 1324 days ago

Being a member of W3C has costs and benefits, as does being a participant in HCLS. If you are going to recommend participation I think you should make a better case and acknowledge the downside.  W3C is a financial commitment, and HCLS is a time commitment. Maybe you can point to accomplishments of HCLS relevant to the GBIF community that would argue that participation is a good idea, or W3C working groups whose recommendations are relevant.

Best
Jonathan

Jonathan Rees 1324 days ago

You touch on the issue of ease of KOS (resource) creation vs. ease of use. I think you should bring this distinction to the fore.  As these affect different (but overlapping) communities it would be worth analyzing which community is experiencing the most pain, since different interventions will benefit the two communities differentially.

Related to this is the issue of feedback from resource users to developers.  Some of the OBO ontologies have paid a lot of attention to this, with well articulated mechanism for term submission and review. A survey of practices in and out of the GBIF sphere would be helpful and I think is along the lines of what GBIF asked for.

And related to this (sorry for free associating) is the issue of resources with infrequent major revisions, vs. frequent and fine-grained (term level) update.  We see both of these forms happening in practice, and it might be helpful to list the pros and cons of each.  (In fact the best run projects seem to do both, with an ongoing frequently updated "current" together with periodic snapshots that are given specific version numbers.)

Best
Jonathan Rees

Jonathan Rees 1324 days ago

The recommendation to invest in Bioportal does not seem to be well motivated.  It is offered as a "resource repository and directory and as part of the life cycle management of ontologies".  I am skeptical that Bioportal would handle the discovery example given ('general gaps' i.).  Assuming that ontology life cycle tools that are easy for domain scientists to use (ii.) is a sensible goal - and I'm not sure that it is - you would need to explain how Bioportal addresses this.  It doesn't match any of the other gaps listed.  So what would it be for?

By way of explaining Bioportal's value proposition, seven features are listed.  Most of these are not relevant to the GBIF community, much less related to identified needs (your gap analysis) or the report guidelines in the CFP.  Mentioning benefits that are not related to recognized needs makes the entire recommendation suspect.

Perhaps setting up Bioportal, and populating it with known resources, would help with resource discovery.  But this process would need to be spelled out in more detail for the case to be made.  How much effort is it to set it up?  How will it be populated and kept up to date?  It may be free, but is that "free as in puppy"? How many user visits would be expected?  Would community benefit outweight costs?

A report on exactly how MMI uses Bioportal, what investment they've made in it, and what benefit they get from it would be helpful.

The authors of the report seem to favor OWL and, by implication, linked data (http: URIs as vocabulary terms).  Bioportal should therefore be examined for its support for OWL.  For example, does it display logical axioms, or follow links?

To make the case better, you also need to enumerate what alternatives you considered as solutions to whatever problems you think Bioportal would solve - including doing nothing, as well as the ones listed in the CFP.  We know that many current ontology efforts seem to be doing fine without Bioportal, and choose not to use it.  What are they missing?

I have to admit I am probably prejudiced as my attempts to use Bioportal in my own work have never turned out well. I'm not saying "don't recommend Bioportal", but rather "make it more convincing and focussed".

Best
Jonathan

Jonathan Rees 1324 days ago

"Its structure makes it less flexible for new applications than DwC" - could you explain this for those of us not familiar with ABCD?  Do you mean the fact that it's a schema means it's hard to build on it and combine it with other things?

Jonathan Rees 1324 days ago

"the notion of 'scientific observations' is gaining traction as a useful data modeling abstraction" -- could you support this claim (e.g. with a citation)?  Thanks

Jonathan Rees 1324 days ago

EOL is certainly interested in working with TDWG and GBIF and ViBRANT and others on vocabulary development and management, particularly with respect to SPM and how it interfaces with other kinds of data of interest to the biodiversity community. I finally had a chance to read through the draft white paper. A comprehensive review of the complex landscape is daunting, and I applaud the authoring group for its collective efforts. However, I admit that I found the organization (full of laundry lists) frustrating and some of the recommendations premature. My notes are available on the document which I will try to attach somewhere.  

A more general issue is what audiences GBIF (and/or TDWG) would intend to serve: Knowledge systems can involve both developers as well as domain experts who build vocabularies, and also domain experts who use them (perhaps without knowing it) to achieve their analysis goals. I'm afraid these audience distinctions don't come clearly through in this document.

If there were some tabular way to summarize the feature sets and pros and cons of all the vocabularies, tools, and systems that would be a huge step forward.

Cyndy Parr 1323 days ago

The following is posted with his permission, from email of Garry.Jolley-Rogers, CSIRO

 


Overall comments.

 

Useful Review. A really good survey and summary.

IMHO, whole of biodiversity domain KOS UNachievable. But piecewise KOS achievable.

Good review of the gaps. especially Lifecycle, validation, and need for citations/ground truthing.

 

Comments

 

1. Biodiversity KOS is not  as simple as the data triangle analogy would imply.  Indeed Knowledge in one Biodiversity domain may be data in another context.

 

The survey

 

2. It would help interpretation and analysis to know more about the cohort who answered the survey.

 

Who? How were they recruited?  Representative of?  Information scientists, managers, biologists?  without context and any details of the survey  cohort, I could not contextualize figures,   numbers, or make meaningful comparisons.

 

Impediments to adoption

 

3.  You cannot underplay the limitations arising from insufficient funding and technical support.

 

** A whole new level of bureaucracy, QC, and data entry are necessary to make this enterprise work.  Biodiversity is in no way resourced to do it.

 

4. Tools are immature / still in development

 

** This is still the field of the latest new method. Methodology and form remains immature and subject to change. Not yet a firm basis on which to build. Biologists have other more pressing domains to master (e.g. Molecular methods and analysis) and so put their efforts elsewhere. This leaves the domain to "second tier and non biological experts" (myself included in this context).

 

Identification of Needs

 

5. Agee. The problem posed by the  proliferation of vocabularies/ontologies is profound.

 

** Factual errors are problem but even moreso is the problems that arise when a vocabulary ontology is used outside of its original context (domain, basis,  ).  We all do it. But how do we not mangle meaning?

 

6. change/variability in taxon concepts or phylogenetic trees should not prove a problem in they are applied appropriately...  but it is sometimes difficult to do this. and often the necessary information  for correct application is omitted.

 

Current Status OF KOS

 

7. Do the multiplicity of renderings for knowledge pose problems in transition / translation?

 

8. Our (TRIN's) critique of SPM is overstated. Our  systematic survey of past and current practices demonstrates that Taxonomic knowledge "categorisations" do not generalize across taxon disciplines and applications (works). Not only do vocabularies vary with taxa but also the relevant knowledge/facts. However, We think SPM and other such "taxon profiles" can be unified by a simple crosswalking tool (and a lot of grunt work). This is where we are working now and then (collaborate) with Cyndy Parr.

 

 

Bob Morris 1323 days ago