Informace o projektu
Deciphering the Language of DNA to Identify Regulatory Elements and Classify Transcripts Into Functional Classes (LanguageOfDNA)

Informace

Projekt nespadá pod Filozofickou fakultu, ale pod Středoevropský technologický institut. Oficiální stránka projektu je na webu muni.cz.

Kód projektu

896172

Období řešení

6/2020 - 9/2022

Investor / Programový rámec / typ projektu

Evropská unie

Horizon 2020
MSCA Marie Skłodowska-Curie Actions (Excellent Science)

Fakulta / Pracoviště MU

Středoevropský technologický institut

Panagiotis Alexiou, PhD

"The Book of Life is written in a four letter alphabet, A, G, C aThe Book of Life is written in a four letter alphabet, A, G, C and T, - with additional marks for DNA structure, methylation, sites conservation etc. In many aspects, understanding of DNA sequences is analogous to understanding natural languages. Machine Learning methods like Recurrent Deep Neural Networks have been successfully applied to both. Examples of use in Genomics include identification of protein binding sites, transcription factors, promoters / enhancers, functional elements like mRNA or lncRNA and even metagenomics classification.

Last two years revolutionized deep learning methods for natural language processing and methods like ELMO (???), BERT (March 2018, Google), ULMFit (January 2018, fast.ai) and LASER (January 2019, Facebook) now provide language model even for cases of limited labeled data size, several meanings of the same word and an attention mechanism focusing on right part of sentence when interpreting given word.

For genomic application, we also often have a limited size of training data (but the whole genome of unlabeled corpus to learned from), the same DNA sequence can have different consequences based on context and we need to know to look for this context. The hope that the newest Deep Learning methods can be useful for genomic data is further strengthen by first experiments, like K. Heyer's Genomic ULMFit, beating several state of the art benchmarks: https://github.com/kheyer/Genomic-ULMFiT However, the number of modern DL methods' application is still very limited.

The aim of this proposal is to change this. Primary goal should include protein binding sites identification. While it is easy to find motifs of the binding sites, the task of prediction whether protein binds to a given DNA location is till not satisfactory solved because the problem cannot be simplified so much. Neural networks previously proved to bring qualitative improvement exactly to areas like this."

Cíle udržitelného rozvoje

Masarykova univerzita se hlásí k cílům udržitelného rozvoje OSN, jejichž záměrem je do roku 2030 zlepšit podmínky a kvalitu života na naší planetě.

Publikace

Počet publikací: 2

2023

Genomic benchmarks: a collection of datasets for genomic sequence classification

GREŠOVÁ Katarína MARTINEK Vlastimil ČECHÁK David ŠIMEČEK Petr ALEXIOU Panagiotis

Článek v odborném periodiku

BMC Genomic Data, rok: 2023, ročník: 24, vydání: 1, DOI

2020

PENGUINN: Precise Exploration of Nuclear G-Quadruplexes Using Interpretable Neural Networks

KLIMENTOVÁ Eva POLÁČEK Jakub ŠIMEČEK Petr ALEXIOU Panagiotis

Článek v odborném periodiku

Frontiers in Genetics, rok: 2020, ročník: 11, vydání: OCT, DOI