|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Indian Genome Variation Consortium : Phase 1 |
|
|
In the first phase
of validation, we determined the extent of genetic diversity and
heterogeneity prevalent in the Indian population For this
purpose, populations were categorized based on the different
geographical zones and linguistic categories and two contrasting
populations in terms of their sizes were minimally selected and
prioritized from each category wherever available (See
Population details).
We identified
55 subpopulations
from which it was decided, to collect, on an average 40 samples
from each population in the first phase. Individual samples from
31 of these are also represented in our discovery panel (See
Composition of Discovery panel).
In addition 8 large, 1 isolated and 2 special populations are
also included in this validation panel to maximize
representation of people of India. The diversity information
generated from the first phase of validation will be utilized as
a criterion for further collection of samples to attain the
target of 15,000.
|
|
|
Identification of
populations |
|
The project aims
to provide SNP database from well-defined ethnic groups that
have been chosen to represent the entire spectrum of diversity
within the Indian population. Considering the population
diversity (See Diversity of Indian Population), two issues had
to be addressed. One, defining the composition of the population
substructure, which captures the entire genetic diversity and
other, the composition of a small panel of samples for SNP
discovery that would ensure representation of SNPs from the
entire Indian population. For this purpose the project has been
carried out in two steps; discovery of SNPs on a small panel
(See
Discovery panel) of 43 samples followed by estimation of
their frequency in a larger set of samples, which constitutes
the validation panel (See Description of
Validation Panel).
This, we felt, would give an estimate of the genetic
heterogeneity that would help us in further substructuring of
the population.
|
|
|
Composition of the
Discovery Panel |
[TOP] |
With a view to
discover novel SNPs as well as to determine the
presence/frequency of the reported SNPs in the Indian
population, an initial panel was made comprising of
representatives drawn from 43 different subpopulations (See Map
of
Discovery Panel). This discovery panel included samples, both
tribal and non-tribal, belonging to diverse geographical zones
and linguistic backgrounds to maximize novel SNP discovery.
Though 43 individuals per se do not represent the entire Indian
population, such a diverse set does increase the heterogeneity
in terms of SNP discovery as compared to a set of samples from a
single subpopulation.
|
|
|
Composition of the
validation panel |
|
The populations
for validation panel have been identified based on the following
criteria; geographical zones, linguistic groups, practice of
endogamy, presence of minority communities from different
religious groups and existence of populations of different
sizes. Four major linguistic lineages, namely, Indo-European,
Dravidian, Tibeto-Burman and Austro-Asiatic have been
considered. We also categorized the populations as small if
their size was <1 million and large if >10 million. This
strategy has been followed to ensure maximum coverage of the
Indian population as well as to capture the minor alleles in
large out bred populations.
|
|
|
Sample Collection |
[TOP] |
The identification
of populations as well as collection of samples have been
carried out with the help of trained anthropologists, social
workers and community health workers, as their participation is
essential for establishing rapport with the general public.
Also, individuals fluent in the local language of the concerned
populations have been consulted and have been actively involved
in the study in order to get maximum and authentic information
from the donors and also to help them to better understand the
purpose of carrying out such an investigation. Endogamy of the
populations has been established by taking extensive information
about the marriage pattern, gathered through pedigrees and
interview of family members of the donor as well as published
literature. A general template to obtain informed consent from
the donors of the samples is used and in cases where the donor
is illiterate, thumb impression is used. In addition, verbal
tape-recorded consent of the donors has also been taken. It has
been is ensured that the individuals are unrelated at least to
the first cousin level. All the institutes have participated in
the collection of samples, with three nodal centres, IGIB, CCMB
and IICB, which are connected to the other centres. IGIB, CDRI,
IMTECH and ITRC have ensured collection of samples from the
northern and central parts of India, IGIB and CCMB from western
part, IICB from eastern part and CCMB from the southern part of
the country.
|
|
|
Ethical Clearance |
|
Each institute has
obtained prior ethical clearance from the Institutional
Bioethics Committee (IBC) for the collection of samples
following the guidelines of Indian Council of Medical Research (ICMR)
(http://icmr.nic.in/ethical.pdf)
for the complete period of 5 years. A uniform bar-coded detailed
questionnaire has been developed, containing information
pertaining to ethnicity, family history of diseases and other
phenotypic traits of the sample donor. Prior to sample
collection, it is explained to the participants that the
personal identifiers in the questionnaire are confidential and
are not available to the researchers. Also, the samples are
irretrievably coded. It is also explained to the volunteers that
the project aims at understanding the extent of variability and
diversity in different subpopulations and the basal data
generated in this study would be used for disease specific
association studies. We also ensure that the participation is
entirely voluntary and no materialistic promises are made to the
donors. Also, no promise for a genetic test is provided.
|
|
|
Managing ethical
issues |
[TOP] |
Although the
project will include no personal identifiers, each sample is
identifiable through a sample code as well as a population code.
There is a provision for the volunteers to withdraw from the
study at their will. Though naming the population with a
particular set of tag SNPs allows a better interpretation of the
biological significance to be used in future studies of
association, population history and population relatedness, it
does, however, have important ethical and social ramifications.
To avoid any social backlash that could destabilize the very
fabric of Indian society, i.e. unity in diversity, a decision
was taken against disclosing the identity of the populations.
This is because, the way a population is labeled in this project
and described in publications will have implications for all
members of the population, as all of them (and all members of
closely related populations) might be affected by the
interpretation and use of findings of future studies. The
samples collected from different populations are bar-coded with
each population being given a specific code revealing the
linguistic affinity of the population, the geographic zone to
which the population belongs as well as the type of population,
viz, large endogamous population, isolated population or special
population.
|
|
|
Strategy and
methods for marker discovery and validation |
[TOP] |
In the first phase
of the project, screening for novel SNPs is carried out in
75 genes on the discovery panel of 43 samples. For this purpose,
amplicons were generated in exonic regions spanning nearly the
entire gene. This would not only provide the data on the SNPs
shared between the different Indian subpopulations and Indian
and other world populations but also reveal population specific
indigenous SNPs. In addition, it provides data on 86
chromosomes, thus enabling the identification of SNPs with an
overall minimal allele frequency (MAF) >0.05 to be used for
further validation.
For the discovery
of novel SNPs, bi-directional sequencing of the 43 samples of
the discovery panel was carried out. A few selection criteria
were evolved for prioritizing the SNPs for validation based on
the data on novel and putative functional SNPs as well as minor
allele frequencies of the SNPs in the discovery panel.
Information on the frequencies of SNPs in different databases
like dbSNP, Celera, RealSNP and HapMap along with the
information on haplotype block structures and tag SNPs were
taken into consideration during selection of SNPs for
validation. Also, though flexible, spacing between the different
selected SNPs within a gene was taken care of depending upon the
size of the gene so as to uniformly cover the entire gene. After
going through these series of filters, additional gaps were
filled, if required, by SNPs reported in the database based on
different validation criteria such as multiple submissions.
In the first phase
of the project, novel and reported SNPs from 75 genes are being
validated on 1,871 samples collected from different populations
using the Sequenom massarray system. The validation process is
being carried out in 2 steps - the initial confirmation of SNPs
is being done in population pools followed by estimation of
frequency in the individual samples. These data would give us
insights for further identification of informative SNPs and
substructuring of the population, which would enable a judicious
collection of the samples.
In the Second
phase of the project which involves genotyping large number of
SNPs in reference population
Affymetrix and
Illumina platform
are being used.
|
|
|
Data release policy
of the Indian Genome Variation project |
[TOP] |
It is envisaged
that the Indian Genome Variation project would eventually be
useful for identifying predisposed haplotypes for common and
complex disorders or the common functional polymorphisms, which
might be useful for pharmacogenomics studies. It would be a
resource that catalogues the common patterns of genetic
variation in important complex disease candidate genes. There is
provision for incorporating or widening the scope of the project
as more and more information on the human genome variations is
being made available with additional information on patterns of
linkage disequilibrium, as well as development of cost effective
high throughput technologies. Though nearly 11 million SNPs have
been released in the public database and the
HapMap
data is available, the selection of the appropriate set
of markers for identifying susceptibility haplotypes for
different complex genetic diseases is still debatable. Moreover,
these databases do not include Indian samples. In the Indian
Genome Variation consortium, there is also a provision for
parallel research to determine factors, which could lead to
generation of informative repeats and SNPs suitable for
designing case-control association studies. These inputs can be
incorporated during the development of the SNP database of the
Indian population.
Usage of the
portal will be freely available for all academic users around
the world. However, the discoveries arising out of the IGV
project will be IPR protected and will be licensed for
commercial exploitation.
|
|