Metagenomics‎ > ‎Glossary‎ > ‎

Pangenome

The pangenome is the entire gene set of all strains of a species. It includes genes present in all strains (core genome) and genes present only in some strains of a species (variable or accessory genome).


The core genome represents the genes present in all strains of a species. It typically includes housekeeping genes for cell envelope or regulatory functions.

The variable or accessory genome (also: flexible, dispensable genome) refers to genes not present in all strains of a species. These include genes present in two or more strains or even genes unique to a single strain only, for example, genes for strain specific adaptation such as antibiotic resistance.

What is the advantage of extracting the pangenome?

  • characterizing strains by their individual gene set (e.g., detecting virulence factors only present in one particular strain of a species)
  • develop vaccines against pathogenic strains
  • detection, identification and tracking of new strains in metagenomics samples based on their individual gene subset of the species pangenome, see → PanPhlAn tool
  • study the evolutionary impact of horizontal gene transfer
  • Exploring strain diversity in environmental population genomics studies

Open or closed pangenome?

Some bacterial species are considered to have an unlimited large gene repertoire (open pangenome) and other species seem to be limited by a maximum number of genes in their gene pool (closed pangenome).

Open versus closed pangenome

Open pangenome: number of gene of the pangenome increases with the number of additionally sequenced strains
Typically, species that live in multiple environments of mixed microbial communities have multiple ways of exchanging genetic material, and hence continuously extent their total set of genes (open pangenome). 
Example: Escherichia coli

Closed pangenome: after some sequenced strains, additional strains don't provide new genes to the species pangenome 
A closed pangenome is typical for species that live in isolated niches with limited access to the global microbial gene pool. For those species a small number of sequenced strains already cover the complete pangenome.
Example: Bacillus anthracis


How to compare the gene content of different strains of a species?

1) Cluster all genes of all genomes by sequence similarity into gene-families
2) Generate gene-family presence/absence profiles for each strain
        → Roary
3) Calculate how many gene-families are covered with 1,2, ... N strains  (repeat with random strain order)

Pangenome tools

  • Roary - Fast tool for extracting complete pangenomes, core gene sets, or differences between reference genomes
  • panX - pangenome analysis and web-based visualization
  • PanOCT - considers both gene homology and conserved gene neighborhoods
  • OrthoMCL - extracting the core genomes, etc..
  • LS-BSR - rapid comparison of the genetic content of large numbers of genomes
  • PanPhlAn - pangenome based detection of gene compositions of strains in environmental WGS samples