Home
UsernamePassword
This group aims to be a platform for general discussion on issues related to biodiversity data publishing

Share |
Group discussion > Who is providing beetle data?

Who is providing beetle data?

Hannu Saarenmaa
2120 days ago

Peer pressure is one of the most effective methods to persuade someone to do something, like publishing data.  At the EEA we used to publish maps where some countries were blank because they had not delivered their data.  When the minister saw that, things started to happen.

Before a major bird data provider joined GBIF they wanted to know who else is providing bird data.  When they realised that in all our neighbour countries bird data was already shared through GBIF, they also wanted to join.

Now I would like to make a similar case with beetles.  How can I get statistics of which countries and databases are already providing beetle data?

I think the nodes have agreed to move more towards "demand-driven" mobilisation of data.  That would mean filling gaps, so that data forms more comprehensive coverages, which are more usable.  I need some instruments to show where those gaps are.

Thank you, Hannu

Andre Heughebaert
2120 days ago

Hi Hannu,

Just by typing Beetles in the search box of GBIF data portal and you got this list:

Datasets

Dataset

Beetles (LSM)

Dataset from GBIF-Sweden

Dataset

Wood-living beetles 2009

Dataset from Norwegian Institute for Nature Research

Dataset

Aquatic Coleoptera Conservation Trust - SNH site condition monitoring of SSSIs for water beetles...

Dataset from UK National Biodiversity Network

Dataset

Longhorn and buprestid beetles (Coleoptera: Cerambycidae, Buprestidae) of the forest-steppe “Bielinek”...

Dataset from Forest Research Institute, Department of Natural Forests

Dataset

Collection of saproxylic and xylobiont Beetles

Dataset from BeBIF Provider

Dataset

Water Beetles of Ireland

Dataset from National Biodiversity Data Centre

Dataset

Saproxylic Beetles of the Tatra Mts.

Dataset from University of Warsaw, Dept. of Ecology

Obviously not complete, but this is a starting point to identify who provides what?

But, I guess you did that already.

André

Steve Wilkinson
2120 days ago

Hi Hannu
I think this is a very good point. Andre's approach above only touches the surface. Eg. for the UK see http://data.nbn.org.uk/datasetInfo/customDatasetList.jsp?grpType=1&sgl1Key=NHMSYS0000079983&sgl2Key=NHMSYS0000080001 which is a list of all the datasets held by us that contain data relating to beetles. Not all of these are exposed to GBIF but many will be.

The fundamental point here is how much does the central GBIF portal do vrs the ideas emerging around forming a local cache around a specific use. I sort of feel that the central portal could go a little further and that the query you have raised is quite pertinent - especially in terms of. as you say, demand driven mobilisation. Now that we have at least the beginning of a taxonomic backbone - is this the sort of thing the central portal could do?

Steve

Angela Suarez-Mayorga
2119 days ago

What Hannu points out is really a challenge that the GBIF central portal should address, and also the advanced NPT for local networks. Than means that searches must be powerful enough to allow different users -especially decision or policy makers- understand what's the information provided by the initiative, and to see it in different formats (like statistics). See for example CRIA developments for collections in Sao Paulo state. CRIA has also a very nice example of how peer presure can influence data publication: They finished the List of the plants of Brazil in less than 2 years --and that is huge!--, mainly because they implemented a system in which it was possible for every participant to see what their colleages had done in regard to the work they had said they were going to do. So, everyone worked on time.

Tim Robertson
2119 days ago

Hi all,

The GBIF Data Portal provides the ability to list the data publishers for a given occurrence search.  If you create a query, and select "specify Data Publishers to be included in search" you will retrieve a list of data publishers.  Similarly one can include a "Host Country" filter to list only the data publishers within a given country.  For example, a search for a list of the publishers in Sweden sharing Carabidae Latreille, 1802 (Ground beetles) can be run [1].  Please note however that this functionality was developed in the portal when data volumes were far smaller, and for queries with large amounts of data the browser and portal will "time out" and not return the results to the user.  Unfortunately a query ofColeoptera Linnaeus 1758 falls into this category.

While the portal has limitations for complex or large queries, the index (MySQL database) behind the portal often holds the information to answer the question.  One option would be to request privileged access [2] to a copy of the GBIF data portal index which would allow you to load it in to your own MySQL server and run these kind of analyses on your own.  For a GBIF Node considering building a case of peer pressure this might be a worthwhile endeavor.   It would require basic databasing skills, and access to a reasonably power machine (16GB memory and 300GB disk space minimum).  Hosted cloud computing servers such as Amazon EC2 [3] might be worth exploring to allow this as you may not need the server for more than a few days and the cost would be minimal.  In exceptional circumstances the GBIF Secretariat staff will do our best to accommodate ad hoc queries where the portal cannot provide the answer.  

In 2012 we anticipate expanding the portal functionality and services so that these kind of reports can be more easily generated by users.  With the new processing infrastructure we employed to improve the portal rollover process in 2011, we have the ability to run ad-hoc queries on the index without locking underlying databases.  Would training and having the ability for node technical staff to run ad-hoc queries (e.g. SQL) against the index be of interest to the community?  This could be explored to help support national level reporting.

If the request for metrics on publishers by country sharing Coleptera was a genuine request, the results are available for download [4].

Best wishes,

Tim

Andre Heughebaert
2119 days ago

Hi Tim, all

Would it be (easily) possible to have a Country/Host matrix as the one provided on all occurrences, but limited to a certain taxonomic group, beetles in this case?

This will answer both questions:

  • where are the geographic gaps for this group?
  • who is providing data for this group?

Country level is a very interesting to answer those questions.

Idealy, the taxonomic group of this query would be defined by a checklist.

Thanks,

Andre

Markus Döring
2119 days ago

Andre,

an interesting idea to use a checklist to define a disparate taxonomic group. If its useful we could surely preprocess such a matrix for every registered checklist in the near future. Thinking about the details, would we only consider terminal taxa in such a list to match with the occurrences? For example if 5 species of Abies are listed we would only count those 5 species but no other Abies or Pinaceae species unless given? But if the list would "terminate" with the Abies genus or even the Pinaceae family we would count all occurrences for the genus / family?

A checklist could also be used to define other common disparate, often paraphyletic groups like fishes and then allow searches or statistics based on that list.

Tim Robertson
2119 days ago

Hi Andre,

Thank you for the suggestion.  

This is exactly the kind of suggestion that we need to help steer future portal developments.  Logically through the user interface, would you think it should be displayed on the "taxon" / species page?  E.g. on any taxon overview, it would have the block containing the matrix.  Or do you think it needs a more complex interface that allows taxonomic browsing and country selection?  If you could elaborate on how you feel the information needs to be surfaced for this use, we will create the issue in our tracking system that we are just beginning to migrate to, to ensure the idea is not lost. 

Markus Döring
2119 days ago

For a checklist based matrix it would probably be linked from the dataset (statistics) page?

Vishwas Chavan
2119 days ago

Dear Hannu,

You are absolutely right that the peer pressure together with incentives will encourage individuals and institutions to publish data. In turn it will also help in discovering the datasets of interest in first place with ease and efficiency.

One of the possible mechanism to keep up the peer pressure is to aim at publishing 'Data Paper'. If fellow scientists/reseracher has published 'Data Paper' in scholarly jorunal, it is certainly will add pressure and/or encourage me to publish datasets that I have.

Incidently, the first IPT derived Data Paper was published in ZooKeys. See related news item: http://www.gbif.org/communications/news-and-events/showsingle/article/first-database-derived-data-paper-published-in-journal/. Data Paper itself is accessible at http://www.pensoft.net/journals/zookeys/article/2002/abstract/literature-based-species-occurrence-data-of-birds-of-northeast-india. It links to the metadata published through the GBIF network http://ibif.gov.in:8080/ipt/resource.do?r=BNHS-NEW.

'Data Paper' not only will create much needed peer pressure, and provide much deserved incentives, but it will also indirectly help in enhancing the quality of the data itself. For instance, in case of the first 'Data Paper', it underwent a rigorous review process lasting nearly three months before being accepted for publication in ZooKeys. The reviwers not only addressed the metadata description, but also suggested improvements to the dataset itself, adding a layer of quality control to the data available to users of GBIF web services. This experience to date has revealed an additional benefit: the critical review process can help to enhance the quality of the biodiversity data published through the GBIF network, meeting a major objective as the volume of data shared continues to expand.

Six of the PenSoft titles, namely ZooKeys, PhytoKeys, MycoKeys, NeoBiota, Nature Conservation and BioRisks are accepting 'Data Paper' submissions. Guidelines for authoring 'Data Paper' are available at http://www.pensoft.net/J_FILES/Pensoft_Data_Publishing_Policies_and_Guidelines.pdf.

I would encourage the custodians and owners who are publishing datasets thropugh the GBIF network to author enriched metadata and submit the 'Data Paper' to one the six journals. For first few papers PenSoft will not charge a processing fee.

I would be more than willing to help/assist anyone in writing their first ever 'Data Paper'.

Vishwas

Nicolas Noé
2119 days ago

About Tim's question (what's needed for implementation):

My first feeling there is that from a database/technical point of view, the demands that will come from such statistics will probably be very varied (following a taxonomic backbones or another, at a broad or detailed level, or according to countries, or institution, or provider, or ...).

Given that, I think the most pragmatic approach is a dual one, as you more or less suggested:

1) For simple to moderately complex cases, it would be great to have it acceessible in a few clicks in a new portal iteration soon.

2) For very complex cases, a manual/advanced mode involving creating SQL queries and running them on a copy of the database. Of course some capicity will be needed, but that's acceptable if the demand is complex/important. Cooperation/help within the GBIF network could also be used here.

I think combining these 2 approaches will allow to have, at least, a very good starting point, without asking the impossible to the IT team.

For the IPT development, this kind of features should noted now to be thinked about deeply later, when designing a very detailed architecture for the advanced version. But it's a little too early now, and fixing such things too prematurely makes more harm than good. And maybe at the time, the need will already be fullfilled by the global tools.

My 2 cents,

 

Nico

Andre Heughebaert
2119 days ago

Markus, Tim, this is how I see it:

User prepares a checklist of Families, Genus, Species or any other taxon ranks, in DwCA format. With this, he can query the Data Portal and to get:

  • Map of all occurrences matching the checklist
  • List of Data Providers (#occurrences provided)
  • Matrix of Country/HostCountry
  • and basically all you get out of the current search

The checklist does not have to be a 'published' checklist, but is rather a convenient way to query the data portal. Therefore, I do not see the need of a complicated taxonomy browser interface as long as this query could start with my own checklist. Adding Geo- and Time filters to this new checklist-based query would be cool. 

As a use case, I would start with the Belgian Invasive Species, but the same mechanism will help Hannu to answer his Beetles question.

Results don't have to be immediate, I'll be happy with a email announcement of the results.

André

Markus Döring
2118 days ago

André, this sounds like a very good idea to me. In the current ECAT portal prototype we had a similar idea that never went live though. It allowed users to upload simple personal list of names, basically a  scientific name with some classification context to be able to deal with homonyms and spelling errors properly. These simple lists would not be registered, but would be persistant and can be modified at any time by a user (login required). That way we can include those lists for all sorts of computation intensive processing. It would also be simple to extend the kind of maps and information you listed. For example we were thinking of some tools to "enrich" these lists with classifications or identifers from a specific authorative checklist (eg ITIS) or to add vernacular names of a certain language. The results would be available as a dwca for download.

Do you think this is something worthwhile exploring further?

Andre Heughebaert
2118 days ago

Markus,

Yes, such persistant (check)lists that users can re-use would be great. I'll support that idea.

Andre 

Hannu Saarenmaa
2118 days ago

Many thanks to you all for the insights to this question.  The beetle provider listing generated by Tim is pretty interesting, and I may be able to use it.

I read that producing these statistics with the current data portal may require heavy processing.  How about building another portal -- a metadata portal?  This would be based on OLAP technology (On-Line Analytical Processing).  An OLAP cube will have the numbers ready along its n-dimensions.  The dimensions in our case would be something like: 1) Continent, 2) Country, 3) Publisher, 4) Dataset, 5) Georeferenced/not, 6) Specimen/observation, 7) Kingdom, 8) Class, 9) Order, 10) Family, 11) Open access / special terms, 12) Number of downloads of data, 13) Primary data published / just metadata available, but publishable on-demand.

Maybe that is too many dimensions to start with, but I hope you get the idea.  One data portal based on relational database model and loaded with primary data may not be able to do all the things we need. Scientists need data, but managers need metadata.

This last dimension could be linked to a request form to mobilise the unpublished dataset on-demand.

Claude Nozeres
2116 days ago

The idea of persistent checklists is a great idea. But I am wondering if this partly misses the original question regarding beetles? 

RE: Hannu. While a query or checklist by dataset, taxa, or even time & geog. are very useful, is it reasonable to a do thematic lists, say, 'soil meiofaunal organisms' or 'marine plankton' ? Currently, I would think that the query would have to be built manually using related information, say, by selecting records for species known to be small and in soil, or marine and swimmers (although I believe WoRMS or OBIS are working on thematic features this for marine species). 

Would such a thematically-generated checklist then be useful for annotating or referring back to the records. i.e., if there are species that are referenced in saved checklists like benthic, pelagic, soil, parasites, these could be used to help build another checklist, say combine epi- and endo-fauna, or combine S. American benthc with N. American benthic.

Or might it be possible to use linked resources that may already tag the taxa by themes like habitat, conservation status, etc, perhaps on Wikipedia, IUCN, EOL, WoRMS, ITIS? Could then build a query based on features not found in the DwC records on GBIF? Just wondering! 

Markus Döring
2116 days ago

I hope those thematic checklists would help, Claude.

At the same time we are also trying to include more species properties found in other sources like IUCN redlists or wikipedia as you mentioned. For good queries its important though that we have a strictly controlled vocabulary. Habitats are a good example. As a start we try to integrate the IUCN threat status, isExtinct and isMarine/Terrestrial as a very simplistic habitat substitute in our backbone taxonomy.

As all occurrence records are tied to this taxonomy, we can then make use of such species properties in queries. Due to the hierarchical nature of a taxonomy we can even propagate properties like isExtinct or isMarine from genus or higher taxa down to all included species. IRMNG is a good source for such information. For example we dont directly know much about Trigonosaurus pricei, but its family is marked as extinct in IRMNG.

Aaike De Wever
2115 days ago

I am glad to learn about the (future) advanced search capabilities and the suggestion on the data processing using the amazon services. I don't have any experience with the latter, but in case anyone would go along this route, I wonder if it would make sense to make the Amazon Machine Image available in order to reduce the work for others wanting to do this too?

In addition, I really appreciate the integration of the kind of properties mentioned by Markus, especially if there would be an isFreshwater flag as well...