A computational linguist with a MA degree in translation, pursuing a PhD in Information and Communication Technologies at the Jožef Stefan International Postgraduate School, Ljubljana, Slovenia. Developing language resources and technologies at the Department for Knowledge Technologies at the Jožef Stefan Institute. Developing and evaluating retrieval-augmented generation (RAG) solutions for the PandaChat project at the PC7 company. Especially active in text categorization tasks, web corpora collection, machine translation, and large language model evaluation.

KT logo_print_medium.jpg

CLASSLA-k-centre-transparent-background.png

Untitled

Untitled


Find me on …

➡️ LinkedIn: https://www.linkedin.com/in/taja-kuzman/

➡️ Google Scholar: https://scholar.google.com/citations?user=2WhxNBcAAAAJ

➡️ Twitter: https://twitter.com/TajaKuzman

➡️ GitHub: https://github.com/TajaKuzman

➡️ Medium: https://medium.com/@taja.kuzman

➡️ At my work: https://kt.ijs.si/members/taja-kuzman/

Currently working on …

🌟 Developing news topic categorization model for the EMMA project

🌟 Involved in the MEZZANINE project, focused on spoken language resources and speech technologies for Slovenian

🌟 Co-leading the CLASSLA knowledge centre for South Slavic languages

🌟 Organising the CLASSLA-Express workshop on using CLARIN.SI corpora in language research

🌟 PhD on automatic genre identification

🌟 Social media for CLARIN.SI (Twitter account, Linkedin account, and Discord)

🌟 Developing and evaluating retrieval-augmented generation (RAG) solutions for the **PandaChat** project

Member of …

🌟 **CLARIN.SI** Management Committee

🌟 Student Council of Jožef Stefan International Postgraduate School (IPS)

🌟 Slovene Society for Language Technologies (SDJT)

🌟 SIGWAC: ACL special interest group on web as corpus

Past projects:

🌟 Web corpora creation and curation: MaCoCu project

🌟 Machine Translation of massive parliamentary corpora: ParlaMint project

Reviewing for …

🔍 Conference: COLING2022, TSD2022, LREC-COLING 2024

🔍 Research Methods in Applied Linguistics journal

🔍Southern African Linguistics and Applied Language Studies journal

🔍Mathematics journal - Probability and Statistics Theory

My work in 4 tweets

https://twitter.com/ClarinSlovenia/status/1732305321710821886

https://twitter.com/TajaKuzman/status/1695057006082551858

https://twitter.com/TajaKuzman/status/1701588652063953224

https://twitter.com/TajaKuzman/status/1716723062341570832

Most relevant recent papers

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation (March, 2024)

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages (March, 2024)

Automatic genre identification: a survey (November, 2023)

Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models (September, 2023)

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification (March, 2023)

Most relevant recent datasets

Most relevant technologies

X-GENRE classifier - multilingual text genre classifier