Department of English and American Studies, Faculty of Arts, Masaryk University
Arna Novaka 1, 660 88 Brno. Czech Republic

Parallel Corpus of English and Czech Texts

KACENKA

(Korpus anglicko-cesky - elektronicky nastroj Katedry anglistiky)

has been created by the Department of English, Faculty of Arts, Masaryk University during the year 1997 to support research and teaching in the field of translation. It was financed by the FR VS (Development Fund for Universities in the Czech Republic). The people participating in the project:

Teachers: Jiri Rambousek, Jana Chamonikolasova
Students: Daniel Miksik, Dana Slancarova, Martin Kalivoda

The idea was to create a small parallel corpus which would enable to work with entire texts in translation analysis rather then short extracts. At the same time it aimed at acquiring experience that could be used in creating a larger parallel corpus of English and Czech in the future.
Although the main part of work has been completed -- and the aims of the KACENKA grant met -- we keep improving and enlarging KACENKA gradually. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning; see the exact calculation -- in Czech) and includes the following texts:

Literary Texts

    Author      Title                     Translator      Format (see below)
____________________________________________________________________________

 1. Kipling,    The Jungle Book           -               
 2.    Rudyard  Kniha dzungli             Maixner         Full
 3.             Kniha dzungli             Skoumal

 4. Amis,       Lucky Jim                 -               
 5.   Kingsley  Stastný Jim               Mucha           Full

 6. Lawrence,   Sons and Lovers           -               
 7.    D. H.    Synove a milenci          Wellek/Vancura  Full
 8.             Synove a milenci          Vancura/Novotna

 9. Dickens,    The Pickwick Papers       -               
10.    Charles  Pickwickovci              Tilschovi       Full

11. Dickens,    Oliver Twist              -               
12.    Charles  Oliver Twist              Tilschovi       Full

13. Hardy,      Jude the Obscure          -               
14.    Thomas   Neblahy Juda              Stankova        Full

15. Hardy,      Tess of the d'Urbervilles -               
16.    Thomas   Tess z d'Urbervillu       Stankova        Full

17. Frost,      The List of Seven         -               
18.    Mark     Seznam sedmi              Rambousek       Full

19. Grahame,    The Wind in the Willows   -               
20.    Kenneth  Zabakova dobrodruzstvi    Grimmichova     2 texts + align (Word)

21. Fielding    Tom Jones                 -               
22.    Henry    Tom Jones                 Kondrysova      orig. text in HTML, 
                                                          translation in Word

23. Asimov,     Reason (a short story)    -               
24.    Isaac    Rozum                     Cerny           3 texts + an align
25.             Dedukce                   Valina          of all three (Word)

26. Shakespeare Sonnets                   -               
27.    William  Sonety                    Macek           Both files for Word


The following two texts were offered to us from outside. We did not add the KACENKA header or change the filenames; we just included the texts as they had reached us. 

28. Everyman                                              Word for Windows                                          
29. Kdokoli    (transl. by Pavel Drabek)

30. Orwell,    1984                                       
      George   (only aligned version)

Non-literary Texts

31. Czech and English versions of a 
    stock-market report                                    Full

32. WHELP      English and Czech versions of a             
               SW help file                                text only

Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest -- and nearly all the Czech texts -- had to be scanned with the use of an OCR programme. For this purpose, ProLector 1.2 (by Improx) has been used with very good results.

The texts were then aligned (to match the corresponding paragraphs and sentences): this turned out to be the most laborious part of the whole process. It was outside the scope of the project to develop new software means for this purpose: however, this remains an inevitable part of the future project.

Format

As the idea of KACENKA was to create data the use of which would not require special computer knowledge, installing specialized corpus managers etc., the form in which the texts are stored is fairly straightforward. KACENKA offers most of the texts in

The texts marked "FULL" in the above lists are available in all these formats.
KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions.