Teachers: Jiri Rambousek, Jana Chamonikolasova
Students: Daniel Miksik, Dana Slancarova, Martin Kalivoda
The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future.
Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning; see the exact calculation -- in Czech) and includes the following texts:
Author Title Translator Format (see below) ____________________________________________________________________________ 1. Kipling, The Jungle Book - 2. Rudyard Kniha dzungli Maixner Full 3. Kniha dzungli Skoumal 4. Amis, Lucky Jim - 5. Kingsley Stastný Jim Mucha Full 6. Lawrence, Sons and Lovers - 7. D. H. Synove a milenci Wellek/Vancura Full 8. Synove a milenci Vancura/Novotna 9. Dickens, The Pickwick Papers - 10. Charles Pickwickovci Tilschovi Full 11. Dickens, Oliver Twist - 12. Charles Oliver Twist Tilschovi Full 13. Hardy, Jude the Obscure - 14. Thomas Neblahy Juda Stankova Full 15. Hardy, Tess of the d'Urbervilles - 16. Thomas Tess z d'Urbervillu Stankova Full 17. Frost, The List of Seven - 18. Mark Seznam sedmi Rambousek Full 19. Grahame, The Wind in the Willows - 20. Kenneth Zabakova dobrodruzstvi Grimmichova 2 texts + align (Word) 21. Fielding Tom Jones - 22. Henry Tom Jones Kondrysova orig. text in HTML, translation in Word 23. Asimov, Reason (a short story) - 24. Isaac Rozum Cerny 3 texts + an align 25. Dedukce Valina of all three (Word) 26. Shakespeare Sonnets - 27. William Sonety Macek Both files for Word The following two texts were offered to us from outside. We did not add the KACENKA header or change the filenames; we just included the texts as they had reached us. 28. Everyman Word for Windows 29. Kdokoli (transl. by Pavel Drabek) 30. Orwell, 1984 George (only aligned version)
31. Czech and English versions of a stock-market report Full 32. WHELP English and Czech versions of a SW help file text onlyMost of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme. For this purpose, ProLector 1.2 (by Improx) has been used with very good results.
The texts were then aligned (to match the corresponding paragraphs and sentences): this turned out to be the most laborious part of the whole process. It was outside the scope of the project to develop new software means for this purpose: however, this remains an inevitable part of the future project.