Showing posts with label GenBank. Show all posts
Showing posts with label GenBank. Show all posts

Tuesday, August 19, 2025

Towards minimizing second-generation mis-identification of Blastocystis - continued

Hi all,

I hope you're well!

A great of deal of Blastocystis research involves surveys and increasing our insight into Blastocystis epidemiology. We do this in order to understand what role this organism plays in health and disease, how we get it, where we get it from, etc. 

Central to such research is our ability to 'speak together' - to have a standardised languge - standardised systems. When we analyse Blastocystis DNA sequences obtained from the stool of human individuals or from the environment (including water and faeces from animals), we need to be able to compare them with other sequences to find out whether they are different or not. That's why we developed the subtype system back in 2007, a terminology that is still in use. However, it is becoming increasingly difficult to 'rely' on Blastocystis reference data because of an increasing amount of misinformation in the NCBI Database, a resource that holds genetic information on the many forms that life has taken, and which is probably used by tens of thousands of people across the globe for reference.  

Last year, we published "Towards minimizing second-generation mis-identification of Blastocystis" in Trends in Parasitology, with a view to make our colleagues aware of DNA sequences in the NCBI Database wrongly annotated as 'Blastocystis'. You will there find a collection of sequences that are not Blastocystis even though it says that they are. I hope you think it's informative and helpful. 

Since then, however, more sequences have come to my attention that are named 'Blastocystis' but that are not (they are sequences of yeast):

There are 16 sequences from Yunnan, China: PQ817670-PQ817679 and PV363977-PV363982. 

Moreover, I believe that these four sequences from Korea are also yeast: OR447548-OR447551. 

There may be other recent ones in GenBank that I haven't noticed yet, so watch out.

Therefore: For sequence comparisons, including sequence alignment and phylogentic analyses, I highly recommend using the collection of reference strains collated by Prof Graham Clark available here. I consider these my 'go-to collection', and they can be considered curated. It's the best database we have at the moment for identifying Blastocystis at subtype level.

If you're in doubt, you're always welcome to take contact to me.

 

 

Wednesday, May 2, 2012

Blastocystis Sequence Typing Home Page

Last year, we launched the Blastocystis Sequence Typing Home Page, which is a publicly accessible resource including two major facilities: 1) A sequence database and 2) An isolate database.
The databases cover both SSU-rDNA data and Multilocus Sequence Typing (MLST) data. For those interested in MLST, please visit this paper.The rest of this post will be about SSU-rDNA sequences.

The database has a BLAST function. Barcoding sequences (i.e. sequences which include the 500 5'-most bases in the SSU-rDNA) can be submitted individually or in bulks, and the output file will include information on subtype (ST) and allele. The number of alleles in ST3 is huge (currently n=38) compared to other subtypes, for which only 2-3 alleles have been identified (e.g. ST8). In case a sequence is submitted that is not similar to an allele already present in the database, I suggest that you do an individual sequence query, which enables the generation of an alignment, which will show you the polymorphism(s). In case a new allele is identified, I suggest that we submit this new allele to the sequence database.
We not only strongly encourage using this BLAST feature for quick and standardised subtype and allele identification, but also for submitting isolate data, i.e. barcode sequences with provenance data (data on host, symptoms, geographical origin, etc.); again this can be done by contacting the curator (me); please look up the site for more information.

Our goal is to produce a database which accommodates large sets of data that can be submitted to scrutiny by everyone. The isolate database currently holds almost 700 isolates with 118 unique alleles - I hope this can be expanded much, much more. Also, data extracts can be done at all times, and below is a random example of an extract from human and non-human data from France downloaded from GenBank:
The colours indicate different alleles in different hosts (see legend to the right). A file with all alleles in fasta format is available here. You can paste them into the search field here for a total list of alleles currently in the database; try clicking on a couple to familiarise yourself with the system... One of the things that we can see here is that alleles 34, 36, 37 (ST3) and allele 4 (ST1) are the most common alleles in humans in France. It may seem a bit confusing to speak of both subtypes AND alleles. However, alleles are a good proxy for MLST data, and hence, looking at alleles is useful, e.g. in terms of transmission studies.

There are many other ways of extracting and visualising data from the isolate database. For more information on barcoding, subtypes, alleles, and the databases, please do not hesitate to contact me. I emphasise that the database only works with sequences that include the barcode region; mutliple SSU-rDNA targets have been used for subtyping, but due to the fact that this database is based on barcode data, we recommend that subtyping be done by barcoding (see references).

Useful literature:

Stensvold, C., Alfellani, M., & Clark, C. (2012). Levels of genetic diversity vary dramatically between Blastocystis subtypes Infection, Genetics and Evolution, 12 (2), 263-273 DOI: 10.1016/j.meegid.2011.11.002  

Scicluna SM, Tawari B, & Clark CG (2006). DNA barcoding of Blastocystis. Protist, 157 (1), 77-85 PMID: 16431158