Building Corpora of Technical Texts : Approaches and Tools

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SOJKA Petr LÍŠKA Martin RŮŽIČKA Michal

Year of publication 2011
Type Article in Proceedings
Conference Fifth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2011
MU Faculty or unit

Faculty of Informatics

Citation
Web
Field Informatics
Keywords language of mathematics;mathematics of language;math representation;m-term;similarity;DML-CZ;EuDML
Description Building corpora of technical texts in Science, Technology, Engineering, and Mathematics (STEM) domain has its specific needs, especially the handling of mathematical formulae. In particular, there is no widely accepted format to represent and handle math. We present an approach based on multiple representations of mathematical formulae that has been used for math retrieval, similarity and clustering of mathematical corpus. We provide an overview of our toolset, summarize our experiments to date and propose further research directions and approaches.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.