Sonifying COVID-19 genetic mutations over time
These uncertain times have brought into sharp focus the importance of scientific research and communication. Our understanding of the world is filtered through the filters of our conceptual and scientific knowledge, and the value of multi-modal representations of data is growing. In response to this we have set up the Covid-19 Listening Project dedicated to the sonic and musical representation of Covid-19 data.
In consultation with geneticist Gemma Bruno (Telethon Institute of Genetics and Medicine, Italy), programming and music technology resources are employed in the communication of relevant genetic patterns in the disease. Such data as the structure of the Covid-19 genome, the points of mutation within it, protein coding, the comparative genetics in the phylogenetic tree of Covid-19 samples and their geographical distribution are translated into pitch, rhythms and harmonies to create rich and compelling communicative works.
A 42-minute choral piece and a genomic-spatial realisation have already been produced using these techniques.
Background
The genome is formed by a long sequence of 4 types of nucleotides: adenine (A), guanine (G), cytosine (c), and uracil (U). This sequence is used to synthesise specific proteins such as the “Spike protein”, which gives COVID-19 its crown-like appearance. A mutation happens when a nucleotide is changed, inserted or deleted in the sequence.
The current sonification methodology associates notes’ timing to the position of the mutation within the DNA and pitch to the type of nucleotide mutation (e.g. adenine to guanine). This means that the position of the mutation results in different rhythmic placement, and the type of nucleotide mutation results in different melodies, giving each genome its own musical signature. Furthemore, repeating musical patterns mean that a mutation has persisted over time.
Note that even though the number of mutations may seem large, COVID-19 is actually considered a relatively stable virus.
Data gathering
The project uses the database of the National Center for Biotechnology Information (NCBI), which is being updated every day with new COVID-19 sequences coming from research centres across the world. The website is available here: NCBI Covid-19 page
The data is parsed and downloaded using the covid-genome-matlab-parser.
How the mutations are obtained from the NCBI dataset
The first step is to obtain the mutations are obtained from the NCBI dataset:
- The genomic comparison is run on the basis of the nucleotide sequences (in GACU symbols).
- In order to reduce the amount of data (currently over 700 genomes), only the first measurement of a day is considered; this currently reduces the number of genomes to 74.
- The comparison of genome pairs is carried out using the Needleman-Wunsch global alignment algorithm [1], which allows to identify the genetic mutations.
- The mutations are considered starting from the beginning of the first protein (NSP1), i.e. at nucleotide n. 266 of the NCBI accession MT019529. Mismatches in the last 100 bases are also ignored (for reference, the RNA of COVID-19 is about 30k bases long).
How the mutations are translated to music
The second step is to translate the mutations into music. Below are two examples of how to do this.
Youtube video
Below are the details of the procedure to generate the sound from the mutations:
- The nucleotide mutations are translated into notes using the table below; a constant value is added to all of them; the
-
sign indicates an insertion or a deletion; all other types of mutations are ignored.
Old basis | New basis | Midi note | Note |
---|---|---|---|
C | - | 47 | B2 |
U | - | 48 | C3 |
A | - | 49 | C♯3/D♭3 |
G | - | 50 | D3 |
G | A | 51 | D♯3/E♭3 |
G | U | 52 | E3 |
G | C | 53 | F3 |
A | G | 54 | F♯3/G♭3 |
A | U | 55 | G3 |
A | C | 56 | G♯3/A♭3 |
U | G | 57 | A3 |
U | A | 58 | A♯3/B♭3 |
U | C | 59 | B3 |
C | G | 60 | C4 |
C | A | 61 | C♯4/D♭4 |
C | U | 62 | D4 |
- | G | 63 | D♯4/E♭4 |
- | A | 64 | E4 |
- | U | 65 | F4 |
- | C | 66 | F♯4/G♭4 |
- The mutations are then organised in 8 groups according to the table below.
Protein name | Group | Instrument | Angle |
---|---|---|---|
NSP1 | 1 | Cello | -45° |
NSP2 | 1 | Cello | -45° |
NSP3 | 2 | Cello | -30° |
NSP4 | 3 | Cello | -15° |
NSP5 | 3 | Cello | -15° |
NSP6 | 3 | Cello | -15° |
NSP7 | 3 | Cello | -15° |
NSP8 | 3 | Cello | -15° |
NSP9 | 4 | Cello | 0° |
NSP10 | 4 | Cello | 0° |
NSP12 | 4 | Cello | 0° |
NSP13 | 5 | Cello | +15° |
NSP14 | 5 | Cello | +15° |
NSP15 | 5 | Cello | +15° |
NSP16 | 5 | Cello | +15° |
S | 6 | Double base | +30° |
ORF3a | 7 | Violin | +45° |
E | 7 | Violin | +45° |
M | 7 | Violin | +45° |
ORF6 | 7 | Violin | +45° |
ORF7a | 7 | Violin | +45° |
ORF8 | 7 | Violin | +45° |
N | 7 | Violin | +45° |
ORF10 | 7 | Violin | +45° |
Non-coding DNA | 8 | Violin | +45° |
-
In order to facilitate spatial discrimination, each group is rendered in a different direction, as indicated in the table above. This specific separation corresponds to the actual lenght of each protein. In other words, this results in a similar experience you’d have if the RNA was actually rolled out around you and each mutation would reproduce a sound from a corresponding direction. This means that not only one can identify the position of the mutation by using the time delay, but also the perceived position of the sound in space, with mutations closer to the beginning of the sequence appearing to your left, while later ones appearing to your right. The specific spatialisation technique used here is Perceptual Soundfield Reconstruction (PSR) [2,3,4].
-
In order to increase presence, each source is rendered with reverberation using Scattering Delay Networks (SDN) [2,5]; the simulated room is rectangular with size 10 m x 10 m x 3 m. The listener is approximately in the center of the space, and the sound source is 2 meters away from the listener.
-
In order to increase presence, each source is rendered with reverberation using Scattering Delay Networks (SDN) [2,5]; the simulated room is rectangular with size 10 m x 10 m x 3 m. The listener is approximately in the center of the space, and the sound source is 2 meters away from the listener.
-
The (informal) protein labels are taken from the NYT article in [7], which were inspired by the findings in [8].
Chorus of Changes (soundcloud)
Here, over 500 genome sequences are translated into two octaves of a B minor scale. The translations are selected by mapping the most common mutations types (‘note deltas’ above) into the most common diatonic scale degrees on a sample of Western Art Music (see Huron 2008)[6]. This results in familiar melodic motifs for the most commonly retained mutations and pandiatonic blurring for the more novel mutations. At the tempo selected this results in a surprisingly engaging piece of music lasting over 42 minutes, where the language of mutation is translated into the language motivic transformation, a deeper sonificaiton beyond arbitrary chormatic or ‘safe’ scale choices.This is performable by choir and organ nut its here rendered with MIDI instrumentations in Ableton Live with UAD and Native Instrument plugins.
Contributing
Please keep the coding convention.
We are particularly interested in collaborations with Molecular Biologists. Contact enzodesena AT gmail DOT com if interested.
Authors
- Enzo De Sena - desena.org (enzodesena AT gmail DOT com)
- Milton Mermikides - Wikipedia page
We are both with the Department of Music and Media, University of Surrey, Guildford, UK University of Surrey DMM Website.
Acknowledgements
We would like to thank Gemma Bruno (Telethon Institute of Genetics and Medicine, Italy), Niki Loverdu (KU Leuven, Belgium) and Nicoletta Bruno (Imperial College London, UK) for the useful discussions on topic. Thanks also to Carl Zimmer for pointing us to the Nature article [8]. Finally, we would like to thank our friends and colleagues for the useful feedback on the presentation.
The project also uses internally:
- Yauhen Yakimovich’s yamlmatlab Github repo
- Ken Schutte’s MIDI tooldbox Website
References
[1] Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis (Cambridge University Press).
[2] H. Hacıhabiboğlu, E. De Sena, Z. Cvetković, J.D. Johnston and J. O. Smith III, “Perceptual Spatial Audio Recording, Simulation, and Rendering,” IEEE Signal Processing Magazine vol. 34, no. 3, pp. 36-54, May 2017.
[3] E. De Sena, Z. Cvetković, H. Hacıhabiboğlu, M. Moonen, and T. van Waterschoot, “Localization Uncertainty in Time-Amplitude Stereophonic Reproduction,” IEEE/ACM Trans. Audio, Speech and Language Process. (in press).
[4] E. De Sena, H. Hacıhabiboğlu, and Z. Cvetković, “Analysis and Design of Multichannel Systems for Perceptual Sound Field Reconstruction,” IEEE Trans. on Audio, Speech and Language Process., vol. 21 , no. 8, pp 1653-1665, Aug. 2013.
[5] E. De Sena, H. Hacıhabiboğlu, Z. Cvetković, and J. O. Smith III “Efficient Synthesis of Room Acoustics via Scattering Delay Networks,” IEEE/ACM Trans. Audio, Speech and Language Process., vol. 23, no. 9, pp 1478 - 1492, Sept. 2015.
[6] Huron, D. (2008) Sweet Anticipation: Music and the Psychology of Expectation. MIT Press
[7] https://www.nytimes.com/interactive/2020/04/03/science/coronavirus-genome-bad-news-wrapped-in-protein.html (Accessed on: 8/3/2020)
[8] Gordon, D.E., Jang, G.M., Bouhaddou, M. et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature (2020)
License
This project is licensed under the GNU License. The code will be made available soon. In the meantime, contact enzodesena AT gmail DOT com if interested.
You use the code, data and findings here at your own risk. See the LICENSE.md file for details.