TEI-CORPO

Conversion tool from Elan, Clan, Transcriber and Praat files to TEI files and back

Java library and Swing user interface

Conversions can be made at this address without using the commande line interface: https://ct3.ortolang.fr/teiconvert/

The Java conversion tool (formats TEI_CORPO, CLAN, ELAN, Transcriber, Praat) can be downloaded here: teicorpo.jar

(Note: The filename teicorpo.jar has changed since version 1.40. Previous name was conversion.jar)

Warning : Java (version >= 8) has to be installed first on your computer to execute commands: Download Java

The source code can be found here https://github.com/christopheparisse/teicorpo The github website contains only the source of the project, not the compiled jar file.

Using the command line conversion tool

The tool can be used as a command line tool. There are several subprograms in the jar file. The main commands are grouped together in a general command which is called TeiCorpo. Other specific command can be useful to execute part of speech tagging or to edit the TEI files. The same general set of parameters applies to all command. Some parameters are command specific, however. The general command has the following form:

java -cp teicorpo.jar fr.ortolang.teicorpo.TeiCorpo -from input-format -to output-format input_files ... -o output [parameters]

All commands use the same input and output parameters:

name of the file or directory where all files to be converted are (might be preceded by -i for more clarity)
-o name of the output file or name of the output directory
-from input format (if -from is omitted, the input format is deduced from the file extension)
-to output format (if -to is omitted, the output format is deduced from the file extension)

The number of files to be converted (input) is not limited. However, only one output parameter can be set. If -o is not set, or if there are more than one input file, the name of the output file will be derived from the name of the input. If no output directory is specified, the output files will be in the same repertory as the input files. The input and output parameters can be repertory names. If the input parameter is a repertory, all files in the file subtree will be converted and placed accordingly in the output file tree.

The use of -from and -to takes precedence on information provided by file extensions. These options (-from and -to) can take the following arguments (all these options correspond to the default format used by the tools):

clan
elan
praat
transcriber
text (raw text files)
srt (subtiles files)

The -to option can also take the following arguments:

txm (files for TXM)
lexico (lexico or le trameur)
text (UTF8 text)
srt (subtitles)

Other parameters that apply to all commands:

-n level: imbrication level (1 for main lines)
-a name : speaker or field produced as output (wildcard characters can be used)
-s name : speaker or field removed as output (wildcard characters can be used)
-rawline : produce files without codes specific to spoken language transcription
-normalize format : produce files from sources of "format" - format can be "clan"
-target format : produce files towards "format" - format can be "praat"
-m name/path for the media file (useful for Praat, Text and Srt files)

Other parameters for exports towards Txm and Lexico

-tv "type:value" : a field type:value is added to the <w> of Txm or Lexico or Le Trameur
-section : add section marker at the end of an utterance (for Lexico/Le Trameur)
-sandhi : information specific for the study of liaisons

Other parameters for exports towards text

-raw transcriptions are produced without speaker or any information
-iramuteq text files takes supplementary information for Iramuteq (headers)
-concat all output files are concatenated into a single output file
-append files are append to the original output file

Parameter for import from text

-normalize noparticipant : the first word of a paragraph is not considered as the name of the speaker

Conversion from Praat can use some specific parameters

-p parameters_file: contains parameters with the format below, one parameter per line.
-m name/path for the media file
-e encoding (by default detect encoding)
-d default UTF8 encoding
-t tiername type parent (description of the relationship between tiers)
- possible types: - assoc incl symbdiv timediv (same as ELAN linguistic types: assoc = Symbolic Association, incl = Included In, symbdiv = Symbolic Subdivision, timediv = Time Subdivision)

Other commands (all are part of TeiCorpo command) :

java -cp teicorpo.jar fr.ortolang.teicorpo.ClanToTei [parameters] (process Chat files, Srt files and Text files)
java -cp teicorpo.jar fr.ortolang.teicorpo.TranscriberToTei [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.PraatToTei [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.ElanToTei [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToClan [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToTranscriber [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToElan [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToPraat [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToTxm [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToLexico [parameters]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToSrt [parameters]

Other commands to edit automatically TEI files

java -cp teicorpo.jar fr.ortolang.teicorpo.TeiEdit [parameters] This command allows modifying the values of the fields media, mediamime, docname and the temporal values in the timeline

To do this, use option -c command=value * -c media=filename * -c mediamime=value * -c docname=filename (internal name of the document used for xml queries) * -c chgtime=value (shift all temporal information by value) * -c replace (do not create a file but replace the old one)

Use of TreeTagger to tag in part of speech a TEI file

put the tree-tagger software (to download here: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) and the parameter file in the current directory (model for example english-utf8.par or spoken-french.par)
or use the environment variable TREE_TAGGER to give the location of the TreeTagger files
COMMAND TO USE: java -cp teicorpo.jar fr.ortolang.teicorpo.TeiTreeTagger -syntaxformat conll|ref|w -model name_of_modele
The parameter -program allows to give directly the address of the TreeTagger on your computer
The parameter -model allows to give directly the address of the TreeTagger language model on your computer
FOR EXAMPLE: java -cp teicorpo.jar fr.ortolang.teicorpo.TeiTreeTagger -program "tree-tagger.exe" -model english-bnc.par myfile.tei_corpo.xml -syntaxformat conll
The -syntaxformat conll parameter allows an efficient export to ELAN or PRAAT by creating a CONLL structure that do not modify the orginal main line
The optional parameter -rawline allows cleaning the orthographic line for specific codes (as best as possible)
Use example with TreeTagger module and PERCEO model installed in the directory /projets/syntax . The shell command analyse.sh contains:

TREE_TAGGER=/projets/syntax
export TREE_TAGGER
java -cp /projets/plceforlibraries/teicorpo.jar fr.ortolang.teicorpo.TeiTreeTagger -syntaxformat conll -model perceo_oral/spoken-french.par -rawline $1

The soft is executed with the command sh analyse.sh filename
The result is found in a file with the extension .tei_corpo_ttg.tei_corpo.xml
This result file can be easily converted to ELAN or PRAAT formats
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiCorpo -to elan filename.tei_corpo_ttg.tei_corpo.xml
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiCorpo -to praat filename.tei_corpo_ttg.tei_corpo.xml

Stanford Natural Language Processing (SNLP)

The Stanford parser, part of speech tagger, and other tools can be called to process the content of the TEI file. The results, as for the TreeTagger program come in three formats.

first download the jar files from SNLP including the language models that you can to use from this page: (https://stanfordnlp.github.io/CoreNLP/index.html#download)
COMMAND TO USE: java -cp "teicorpo.jar:directory_for_SNLP/*" fr.ortolang.teicorpo.TeiSNLP -syntaxformat conll|dep|ref|w -model english|french|filename.properties filename.tei_corpo.xml
jar files for CoreNlp and the jar language models can be found here https://stanfordnlp.github.io/CoreNLP/.
valid properties files can be found here https://ct3.ortolang.fr/tei-corpo/properties/.
the conll and dep parameters provide the result in conll format (columns x lines). The pos paramter provides the result in "ref" format (see TreeTagger)
as for TreeTagger, conll format can be converted to Praat and Elan, and ref format to TXM and Lexico.

History version

1.00 Initial fully functional version
1.40 Rename jar file. English version.
1.40.2 Corrections of bugs for CLAN import/export. English version improved.
1.40.3 Bug correction
1.40.4 Inclusion of Stanford Natural Language Processing program
1.40.5 Add text and srt input format
1.40.6 Bug correction (conversion to praat and elan) and modification of argument name in <ref> syntactic information

TEI-CORPO

Outil de conversion de Elan, Clan, Transcriber et Praat vers la TEI et vice-versa

Java library and Swing user interface

Les conversions peuvent être faites en ligne à cette adresse sans passer par l'interface de commande : https://ct3.ortolang.fr/teiconvert/

L'outil Java de conversion de formats (TEI_CORPO, CLAN, ELAN, Transcriber, Praat) peut être téléchargé ici : teicorpo.jar

Attention : il faut avoir installé Java sur son ordinateur pour exécuter les commandes : Télécharger Java

Le code source du programme est disponible dans https://github.com/christopheparisse/teicorpo Le site github ne contient que les fichiers sources du projet.

Utilisation de l'outil de conversion en ligne de commande

L'outil est utilisable en ligne de commande. Il existe plusieurs commandes qui peuvent être exécutées. Les commandes principales sont regroupées dans une commande générale TeiCorpo. Les paramètres complémentaires ont la même forme pour toutes les commandes, mais certains paramètres ne s'appliquent qu'à certaines commandes.

java -cp teicorpo.jar fr.ortolang.teicorpo.TeiCorpo -from format-entree -to format-sortie fichiers_input -o output [paramètres]

Toutes les commandes utilisent les mêmes paramètres d'entrée sortie:

nom du fichier ou répertoire où se trouvent les fichiers à convertir (peut être précédé de -i)
-o nom du fichier de sortie des fichiers ou répertoire des fichiers résultats
-from format d'entrée (si -from est omis, le format d'entrée est déduit de l'extension de fichier)
-to format de sortie (si -to est omis, le format de sortie est déduit de l'extension de fichier)

Le nombre d'éléments à convertir n'est pas limité. Par contre un seul paramètre de sortie peut être donné avec -o. Si l'option -o n'est pas spécifié, ou s'il y a plus d'un fichier entrée, le fichier de sortie aura le même nom que le fichier d'entrée, avec une autre extension, et sera stocké au même endroit. Les paramètres entrée et sortie peuvent être des noms de répertoire. En entrée, tous les fichiers de l'arborescence correspondant au format de l'option -from (ou tous les fichiers de type connus si pas d'option -from) seront convertis. En sortie, un nom de répertoire servira d'emplacement pour les fichiers produits.

L'usage de -from et -to est prioritaire sur les informations données par les extensions de fichier. Les options -from et -to peuvent accepter les valeurs suivantes:

clan
elan
praat
transcriber

L'option -to peut accepter les valeurs complémentaires suivantes:

txm (fichiers pour TXM)
lexico (lexico ou le trameur)
text (texte UTF8)
srt (sous-titres)

Paramètres complémentaires s'appliquant à toutes les commandes

-n niveau: niveau d'imbrication (1 pour lignes principales)
-a name : le locuteur/champ name est produit en sortie (caractères génériques acceptés)
-s name : le locuteur/champ name est suprimé de la sortie (caractères génériques acceptés)
-rawline : produit des fichiers sans codes spécifiques de transcription orale
-normalize format : produit des fichiers à partir de sources "format" - format peut valoir "clan" (autres sources à venir)
-target format : produit des fichiers vers le "format" - format peut valoir "praat" (autres destinations à venir)

Paramètres supplémentaires pour les exports vers Txm et vers Lexico

-tv "type:valeur" : un champ type:valeur est ajouté dans les <w> de txm ou lexico ou le trameur
-section : ajoute un indicateur de section en fin de chaque énoncé (pour lexico/le trameur)
-sandhi : information spécifique intégrées pour l'analyse des liaisons

Paramètres supplémentaires pour les exports vers du texte

-raw seules les transcriptions sont produites sans information de locuteur ou autre
-iramuteq les fichiers texte sont complétés par les entêtes pour l'usage d'Iramuteq
-concat tous les fichiers sortie sont ajoutés dans un seul fichier output
-append les fichiers sont ajoutés au fichier original

La conversion depuis Praat dispose de paramètres supplémentaires

-p fichier_de_paramètres: contient les paramètres sous leur format ci-dessous, un jeu de paramètre par ligne.
-m nom/adresse du fichier média
-e encoding (par défaut detect encoding)
-d default UTF8 encoding
-t tiername type parent (descriptions des relations entre tiers)
- types autorisés: - assoc incl symbdiv timediv (correspond aux types linguistiques de ELAN: assoc = Symbolic Association, incl = Included In, symbdiv = Symbolic Subdivision, timediv = Time Subdivision)

Commandes complémentaires (faisant partie de TeiCorpo) :

java -cp teicorpo.jar fr.ortolang.teicorpo.ClanToTei [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TranscriberToTei [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.PraatToTei [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.ElanToTei [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToClan [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToTranscriber [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToElan [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToPraat [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToTxm [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToLexico [paramètres]
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiToSrt [paramètres]

Commande supplémentaire pour éditer automatiquement les fichiers TEI

java -cp teicorpo.jar fr.ortolang.teicorpo.TeiEdit [paramètres] Cette commande permet de modifier les valeurs des champs media, mediamime, docname et les valeurs temporelles dans la timeline Pour cela utiliser l'option -c commande=valeur
- -c media=nom_de_fichier
- -c mediamime=valeur
- -c docname=nom-de_fichier (nom interne du document utilisé pour l'interrogation en xml)
- -c chgtime=valeur (décale tous les repères temporels de 'valeur')
- -c replace (ne crée pas un nouveau fichier mais remplace l'ancien)

Utilisation de TreeTagger pour analyser en parties du discours un fichier Tei

placer le programme tree-tagger (à télécharger ici: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) dans le répertoire courant ainsi que le fichier de paramètre (model par exemple english-utf8.par ou spoken-french.par)
ou utiliser la variable d'environnement TREE_TAGGER pour indiquer l'emplacement des fichiers TreeTagger
COMMANDE à EFFECTUER: java -cp teicorpo.jar fr.ortolang.teicorpo.TeiTreeTagger -syntaxformat conll|ref|w -model nom_du_modele
Le paramètre -syntaxformat conll permet un export efficace vers ELAN ou PRAAT en créant une structure de type CONLL qui ne modifie aucument la ligne orthographique originale
Le paramètre optionnel -rawline permet de nettoyer pour l'analyse la liste orthographique original des caractères de codage (lorsque c'est possible)
exemple d'utilisation avec des commandes TreeTagger et un modèle PERCEO installés dans le répertoire /projets/syntax de l'ordinateur. Le programme analyse.sh contient:

TREE_TAGGER=/projets/syntax
export TREE_TAGGER
java -cp /projets/emplacementlibraries/teicorpo.jar fr.ortolang.teicorpo.TeiTreeTagger -syntaxformat conll -model perceo_oral/spoken-french.par -rawline $1

Le programme s'exécute en utilisant la commande sh analyse.sh nomdefichier
Le résultat apparait dans un fichier portant l'extension .tei_corpo_ttg.tei_corpo.xml
ce fichier peut être directement converti au format ELAN ou PRAAT
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiCorpo -to elan nomdefichier.tei_corpo_ttg.tei_corpo.xml
java -cp teicorpo.jar fr.ortolang.teicorpo.TeiCorpo -to praat nomdefichier.tei_corpo_ttg.tei_corpo.xml

Versions

1.00 Version initiale complète
1.40 Ficher jar renommé. Version anglaise par défaut en ligne de commande.
1.40.2 Corrections de bugs import/export pour CLAN. Version anglaise améliorée.
1.40.3 Corrections de bugs
1.40.4 Utilisation de Stanford Natural Language Processing
1.40.5 Import text et srt (sous-titres)
1.40.6 Corrections de bugs (conversion vers praat et elan) et changement nom attribut dans <spanGrp><ref>