A computational linguist with a MA degree in translation, pursuing a PhD in Information and Communication Technologies at the Jožef Stefan International Postgraduate School, Ljubljana, Slovenia. Developing language resources and technologies at the Department for Knowledge Technologies at the Jožef Stefan Institute. Developing and evaluating retrieval-augmented generation (RAG) solutions for the PandaChat project at the PC7 company. Especially active in text categorization tasks, web corpora collection, machine translation, and large language model evaluation.
➡️ LinkedIn: https://www.linkedin.com/in/taja-kuzman/
➡️ Google Scholar: https://scholar.google.com/citations?user=2WhxNBcAAAAJ
➡️ Twitter: https://twitter.com/TajaKuzman
➡️ GitHub: https://github.com/TajaKuzman
➡️ Medium: https://medium.com/@taja.kuzman
➡️ At my work: https://kt.ijs.si/members/taja-kuzman/
🌟 Developing news topic categorization model for the EMMA project
🌟 Involved in the MEZZANINE project, focused on spoken language resources and speech technologies for Slovenian
🌟 Co-leading the CLASSLA knowledge centre for South Slavic languages
🌟 Organising the CLASSLA-Express workshop on using CLARIN.SI corpora in language research
🌟 PhD on automatic genre identification
🌟 Social media for CLARIN.SI (Twitter account, Linkedin account, and Discord)
🌟 Developing and evaluating retrieval-augmented generation (RAG) solutions for the **PandaChat** project
🌟 **CLARIN.SI** Management Committee
🌟 Slovene Society for Language Technologies (SDJT)
🌟 SIGWAC: ACL special interest group on web as corpus
🌟 Web corpora creation and curation: MaCoCu project
🌟 Machine Translation of massive parliamentary corpora: ParlaMint project
🔍 Conferences: COLING2022, TSD2022, LREC-COLING 2024, COLING2025
🔍 Journals: Research Methods in Applied Linguistics, Southern African Linguistics and Applied Language Studies, Mathematics journal - Probability and Statistics Theory, Nature Scientific Reports journal
**PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications (October, 2024)**
**JSI and WüNLP at the DIALECT-COPA Shared Task: In-Context Learning From Just a Few Dialectal Examples Gets You Quite Far (June, 2024)**
**CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation (March, 2024)**
**Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages (March, 2024)**
**Automatic genre identification: a survey (November, 2023)**
**Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models (September, 2023)**
**ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification (March, 2023)**