Shared tooling for the UD Turkic group: cross-language clustering, annotation strategy tables, treebank discovery, and general UD utilities.
Cross-language lemma clustering across all Turkic UD treebanks. Maps overarching concepts (e.g., "BOL" for copula) to language-specific lemmas and generates UPOS × deprel distributions. See docs/ud-query.md for the query approach.
# List all Turkic languages and treebanks
python clustering/turkic_clustering.py --list-languages
# Process all Turkic languages with lemma mapping
python clustering/turkic_clustering.py --lemma-mapping clustering/data/complete_lemma_mapping.json --output results.json
# Generate annotation strategy tables from results
python clustering/generate_annotation_tables.py results.json
# Discover all UD Turkic repos via GitHub CLI
python clustering/get_ud_repos_with_gh.py --output ud_repos.jsonData:
clustering/data/complete_lemma_mapping.json— lemma mappings across Turkic languagesclustering/data/ud_languages.json— language-to-treebank mappingclustering/data/lemma_mapping.csv— tabular lemma mapping
General UD utilities (not Turkic-specific), moved from ud-turkic/parallel.
compare_treebanks.py— compare annotations between two CoNLL-U files with the same sentence IDscount_tokens.py— token/POS/feature statistics for CoNLL-U filesfix_spaceafters.py— fill in missingSpaceAfter=Nofrom UD validator error logsgenerate_treebank_stats.py— generate statistics tables (LaTeX/Markdown/JSON) for all UD Turkic treebanks
python ud/compare_treebanks.py treebank1.conllu treebank2.conllu
python ud/count_tokens.py corpus.conllu
python ud/fix_spaceafters.py error_log.txt treebank.conllu
# Generate treebank stats table from pre-cloned repos
python ud/generate_treebank_stats.py --local-dir /tmp/ud-impact --format both
# LaTeX only, with UD version in caption
python ud/generate_treebank_stats.py --local-dir /tmp/ud-impact --format latex --ud-version 2.15 --output stats.tex
# JSON output for downstream processing
python ud/generate_treebank_stats.py --format json --output stats.json- ud-tools — general UD tooling (
udvalidate,udsearch,udeval) - ud-turkic/parallel — parallel treebank tools