A data scientist and computational linguist with a PhD in Information and Communication Technologies (focus on text classification - machine learning - large language models and AI). Researching, designing and deploying machine learning/AI and LLM‑based solutions as part of a team driving digital transformation in a global manufacturing environment at the International Flavors & Fragrances, Inc. Developing language resources and technologies at the Department for Knowledge Technologies at the Jožef Stefan Institute. Especially active in text categorization tasks, web corpora collection, machine translation, and large language model evaluation.



To learn more about my past projects, you can listen to the TPC Klub interview with me (in Slovenian) or the DIHUR podcast (in English):
https://youtu.be/usIgGW41xP0?si=_b-c0vP80N_40MwI
https://youtu.be/fdzA2uE1klQ?feature=shared
➡️ LinkedIn: https://www.linkedin.com/in/taja-kuzman/
➡️ Google Scholar: https://scholar.google.com/citations?user=2WhxNBcAAAAJ
➡️ GitHub: https://github.com/TajaKuzman
➡️ Medium: https://medium.com/@taja.kuzman
🌟 Slovene Society for Language Technologies (SDJT)
🌟 Slovenian Artificial Intelligence Society (SLAIS)
🌟 SIGWAC: ACL special interest group on web as corpus
🌟 Co-leading the CLASSLA knowledge centre for South Slavic languages and the LLMs4SSH centre for Large Language Models for Social Sciences and Humanities
🌟 Organising the CLASSLA-Express workshop on using CLARIN.SI corpora in language research
🌟 Evaluating LLMs for Slovenian and South Slavic languages in LLM4DH (2024-2027) and LLMs4EU (2025-2028) projects
🌟 Enriching parliamentary datasets with topic based on the Comparative Agendas Project (CAP) schema in the ParlaCAP project (2024-2027)
🌟 Developing news topic categorization model for the **EMMA project** (2023-2026)
🌟 Developing and evaluating retrieval-augmented generation (RAG) solutions for the **PandaChat** project by the PC7 company (2023-2025)
🌟 Web corpora creation and curation: **MaCoCu project** (2021-2023)
🌟 Machine Translation of massive parliamentary corpora: **ParlaMint II project** (2021-2023)
🔍 Conferences: COLING2022, TSD2022, LREC-COLING 2024, COLING2025
🔍 Journals: PeerJ Computer Science journal, Natural Language Processing journal, Research Methods in Applied Linguistics, Mathematics journal - Probability and Statistics Theory, Nature Scientific Reports journal, Southern African Linguistics and Applied Language Studies, Connection Science journal
**State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? (November, 2025)**
**LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification (February, 2025)**
**PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications (October, 2024) -** 🥇 best paper award at the Slovenian Conference on Artificial Intelligence (Information Society Multiconference IS2024)
**JSI and WüNLP at the DIALECT-COPA Shared Task: In-Context Learning From Just a Few Dialectal Examples Gets You Quite Far (June, 2024)**
**CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation (March, 2024)**
**Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages (March, 2024)**
**Automatic genre identification: a survey (November, 2023)**
**Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models (September, 2023)**
**ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification (March, 2023)**