What did we learn from the novel coronavirus genomes from India ?

Vinod Scaria, Bani Jolly
15th May 2020


From a possible isolated outbreak in a seafood market in Wuhan, China, coronavirus disease (COVID-19) has now emerged as a global pandemic affecting over four million people and killing over 2 lakh people worldwide. The causative organism for this global pandemic is a small virus, so small that over 400,000 of its kind can fit on the tip of a needle. This new strain of coronavirus, named as novel coronavirus 2019 or 2019-nCoV has an RNA genome of around 30,000 nucleotide bases and belongs to the betacoronavirus genus.

The first genome sequence of the novel coronavirus isolated from a man working at the seafood market in Wuhan was made available in the public domain by a consortium of researchers in China and today serves as the reference point for understanding the virus and its evolution. Like all organisms, coronavirus evolves through the accumulation of genetic mutations. Unlike the influenza viruses which cause common flu, the SARS coronavirus mutates at a much slower pace. It is estimated that the virus accumulates one mutation approximately every 15 days. As the virus replicates and transmits, mutations get accumulated in its genome thus forming different evolutionary groups or ‘clades’.

Sequencing the genome of the virus can provide a view of the genetic mutations in a particular strain and how it compares with the rest of the strains worldwide. As the virus is transmitted from individual-to individual and spreads across the world, mutations get accumulated in its genome and thus divide the virus isolates into different closely-related evolutionary groups or ‘clades’. Based on genomic data obtained from GISAID Database which shares genomic data, Nextstrain, a research network for comparison of viral genomes, has broadly divided the genomes into 10 clades: A1a, A2, A2a, A3, A6, A7, B, B1, B2, B4. Clades A1a, A2, A2a, A3, A6, A7 collectively form the ‘A’ supergroup or ‘superclade’ of the novel coronavirus, also known as the ‘European clade’ since the sequences falling under this type originated largely from European nations. Similarly, the 4 remaining clades form the second superclade ‘B’, also termed as the ‘East Asian clade’ based on its origins.

Classification of the clades early in the epidemic could help track where the virus has been and roughly suggest the origin of the infection. The sequence of the viral genome thus provides researchers with an opportunity to understand how the virus evolves, and more importantly how its spread occurred across the world, and in many cases from individual to individual, as in the case of outbreaks.

The initial genomes of novel coronavirus isolates from India were obtained from two patients who had traveled from Wuhan, China to Kerala. The genomes were sequenced and deposited by the National Institute of Virology which is based in Pune. Till date, over 200 genomes of the novel coronavirus isolates from India have been deposited in public databases globally. These include isolates from a variety of government agencies including National Institute of Virology, National Centre for Disease Control, National Institute of Biomedical Genomics, National Institute of Mental Health and Neuro-Sciences and the likes. The majority of genomes are made available through a collaborative effort between the National Centre for Disease Control and CSIR Institute of Genomics and Integrative Biology. Of noteworthy mention is the Gujarat Biotechnology Research Centre, a state-sponsored research organisation which has deposited over 50 genome sequences of novel coronavirus isolates collected from across Gujarat.

Novel coronavirus genomes can be compared one against another, based on the genetic mutations they have. This comparison can help map out a visual construction of what is scientifically known as a phylogenetic tree - a family tree of the virus that depicts how the different virus genome sequences are related to each other. Clades can thus be identified on the tree as a cluster that shares a common ancestor and descends from the same branch of the tree. Phylogenetic analysis has shown that the Indian coronavirus isolates largely cluster into five clades, A1a, A2a, A3, B and B4 (Figure1). 


Figure 1 Phylogenetic clusters of the SARS-nCoV-2 genomes from India. With the clades marked at the origin of the leaves. (as on 14/05/2020). An updated browse-able version is available at http://bit.ly/c19phylovis

Most of the Indian genomes fall in the A superclade, with a majority encompassing A2a and A1a clades and a few in A3 clade. The A2a clade is globally one of the predominant clades. The genomes in A3 clade, which mostly was reported previously from Iran are from isolates collected from Ladakh. The initial genomes from Kerala fell into the B clade, and are from individuals who had travelled from Wuhan. The recent addition of B4 clade to the Indian cluster was largely through the sequencing efforts of Gujarat Biotechnology Research Centre and National Institute of Biomedical Genomics from Gujarat and West Bengal respectively. The B4 clade is a sub-type of the superclade B with potential origins from either East Asia or Oceania. Sequences that belong to the B4 clade harbour 2 distinguishing mutations in their genomes. The first mutation, L84S, in the gene ORF8 is common among all clades of the B supertype. The other mutation is S202N in the gene that encodes the nucleocapsid protein of the virus. In the phylogenetic tree of genomes from India, three genomes from Gujarat and one from West Bengal fall under this cluster.

While there are a number of people suggesting that the virus clades have a difference in severity, many of the claims have not been adequately proven with confidence. Though some clades of the novel coronavirus are predominantly prevalent in some locations than in others, there is insufficient data to draw conclusions about the differences in virulence and clinical outcomes of these clades. More genomes and systematic tagging of clinical information for the genomes would significantly improve our understanding in this direction.

The COVID-19 pandemic has highlighted the need to share genomic data on a global scale. It is heartening to note that over 25,000 coronavirus genomes are made publicly available by researchers from over 70 countries, making it one of the best examples of Open Data initiatives shaping across the globe. It also highlights the renewed interest in Open Source movements for developing better diagnostics and developing novel therapeutics and as rightly emphasised by Tedros Adhanom, Director General of the World Health Organisation in his recent address to the world.


About the Authors
The authors are from the CSIR Institute of Genomics and Integrative Biology, Delhi. 
All opinions expressed are personal. 
Authors can be contacted at vinods@igib.in (Vinod Scaria) / bani.jolly@igib.in (Bani Jolly)


Comments