Research

Eight systems and lines of work. The thread connecting them is making computers understand human feeling and human language — especially Japanese, especially online, and increasingly for languages nobody else builds tools for.

ML-Ask · Affect Analysis System

emotive elements, emotive expressions, and Russell's 2D affect space — for Japanese.

ML-Ask emotion lexicon

ML-Ask — eMotive eLement and Expression Analysis system — is a keyword-based, language-dependent system for automatic affect annotation of Japanese utterances. It is built on a simple linguistic assumption: a speaker's emotional state is conveyed by emotional expressions used in emotive utterances. ML-Ask first decides whether a sentence is emotive at all, then — within emotive sentences only — looks for expressions of specific emotion types.

Two ingredients carry the system. Emotemes are signal words that mark emotivity without specifying which emotion — interjections (すごい sugoi), mimetic expressions (わくわく wakuwaku), vulgar morphemes (〜やがる -yagaru), and emotive sentence markers ("!", "??"). Emotive expressions are the words that name the feeling itself — nouns (愛情 aijou, love), verbs (悲しむ kanashimu, to grieve), adjectives, and set phrases. The expression database is based on Akira Nakamura's Emotive Expression Dictionary, sorted into ten classical Japanese emotion types (joy, anger, sorrow, fear, shame, fondness, dislike, excitement, relief, surprise) — roughly 2,100 expressions in total.

ML-Ask also implements Contextual Valence Shifters (Polanyi & Zaenen, 2006) with 108 Japanese negation patterns, and projects the detected emotion onto Russell's two-dimensional model of affect (valence × activation) — so downstream applications can reason about positive-activated vs. negative-deactivated mood rather than 10 discrete labels.

Preferred citations

  • Michal Ptaszynski, Pawel Dybala, Rafal Rzepka, Kenji Araki, "Affecting Corpora: Experiments with Automatic Affect Annotation System — A Case Study of the 2channel Forum". PACLING-09, Sapporo, 2009.
  • Michal Ptaszynski, Pawel Dybala, Wenhan Shi, Rafal Rzepka, Kenji Araki, "A System for Affect Analysis of Utterances in Japanese Supported with Web Mining". J. Japan Society for Fuzzy Theory and Intelligent Informatics, 21(2), 2009. PDF ↗

CAO · Emoticon Analysis System

10,000+ Japanese kaomoji, decomposed into eyes / mouth / framing — and reassembled into emotion.

CAO logo

CAO is a fully automatic analyser for Japanese kaomoji-style emoticons — the (^_^) / orz / (╯°□°)╯ family of pictographic glyphs that dominate Japanese online communication. Given an input string it identifies emoticons and assigns them to specific emotion types.

The pipeline runs in two stages. First, a database lookup against more than ten thousand pre-collected emoticons. For glyphs not in the database — there are always new ones — CAO performs structural decomposition into semantic regions: eyes, mouth, and framing characters. Each region carries its own emotion distribution, learned from co-occurrence patterns in the database, and the final label is the joint probability over those regions. The design is grounded in kinesics, Birdwhistell's theory of non-verbal communication.

Downloads All emoticons (sorted) Triplets (eye-mouth-eye) Mouths + frequencies Eyes + frequencies Standalone detector (Perl)
LicenseNew BSD (3-Clause)
Award2011 IEEE Sapporo Section Encouragement Award.

Preferred citation

  • Michal Ptaszynski, Jacek Maciejewski, Pawel Dybala, Rafal Rzepka, Kenji Araki, "CAO: A Fully Automatic Emoticon Analysis System Based on Theory of Kinesics". IEEE Transactions on Affective Computing, 1(1), pp. 46–59, 2010.
  • Michal Ptaszynski, Jacek Maciejewski, Pawel Dybala, Rafal Rzepka, Kenji Araki, "CAO: A Fully Automatic Emoticon Analysis System". AAAI-10, Atlanta, 2010.

Automatic Cyberbullying Detection

From ML-Ask + SVM to a patented PMI-IR method — built on the only annotated real-world Japanese cyberbullying dataset.

Cyberbullying project poster

This research began in September 2009 at the PACLING banquet, when Prof. Fumito Masui mentioned he'd been collecting cyberbullying entries from the unofficial websites of Japanese schools — the kind of pages that the Mie Prefecture Human Rights Center had been trying to monitor manually. The volume of suspect pages had outgrown what teachers and PTA volunteers could read by hand, so the question was natural: can we automate the triage?

The first published method (AISB 2010) used ML-Ask to find that vulgar and violent vocabulary were the strongest discriminators, then fed a dedicated lexicon into an SVM classifier. SVMs eventually hit a ceiling — Japanese cyberbullying is wordplay-heavy and words alone miss context — so we moved to SO-PMI-IR. The trick was to apply Turney's method not to individual words but to phrases: this cleared most ambiguity, and grouping the seed words further (Nitta et al., IJCNLP 2013) lifted accuracy enough that the method was eventually filed as a patent (JP 2015-103210).

Current work goes in three directions: (1) finding a release-grade preprocessing for the corpus — masking enough personal information that the data can be shared with other labs without identifying victims; (2) tightening the PMI method further with student projects on lexicon expansion and parameter optimization; (3) Language-Combinatorics-based pattern mining to find recurrent cyberbullying constructions automatically.

PatentJP 2015-103210 (SO-PMI-IR for harm detection) ↗
Outreach Facebook project page
StakeholdersMie Prefecture Human Rights Center · Parental Options (USA)

Key references

  • Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka, Kenji Araki, "Machine Learning and Affect Analysis Against Cyber-Bullying". AISB'10, Leicester, 2010.
  • Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka, Kenji Araki, Yoshio Momouchi, "In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis". Int'l J. Computational Linguistics Research, 1(3), 135–154, 2010.
  • Taisei Nitta, Fumito Masui, Michal Ptaszynski, Yasutomo Kimura, Rafal Rzepka, Kenji Araki, "Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization". IJCNLP 2013, Nagoya, pp. 579–586. PDF ↗
  • Michal Ptaszynski, Fumito Masui, Yasutomo Kimura, Rafal Rzepka, Kenji Araki, "Brute Force Works Best Against Bullying". IJCAI-15 IP Workshop, Buenos Aires, 2015.

YACIS · Yet Another Corpus of Internet Sentences

5.6 billion words of Japanese blog text — the largest single-genre Japanese affect-annotated corpus we know of.

corpus

YACIS is a large-scale corpus of Japanese blog sentences scraped, deduplicated and lemmatized for use in NLP and affective-computing research. The headline number is the ~5.6 billion word tokens spanning Ameba blog domains, but the technically interesting part is the annotation layer: the entire corpus was passed through ML-Ask 4.2 (the "fast and furious" branch) and CAO, producing per-sentence affect labels and per-emoticon emotion tags. This makes YACIS unusually useful for downstream tasks that need real-world emotion distributions in informal Japanese — distributions you can't get from formal corpora like KWDLC.

ML-Ask 4.2's regex precompilation and emoticon-detection rewrite (≈10× faster than ML-Ask 4.0) were originally driven by what YACIS required to annotate in finite time.

SPEC · Sentence Pattern Extraction Architecture

Language-independent extraction of n-element ordered combinations — n-grams' more flexible cousin.

SPEC logo

SPEC formalizes a "sentence pattern" as an n-element ordered combination of sentence elements (tokens, characters, POS tags, or any user-defined unit). Unlike n-grams the combination doesn't have to be contiguous: pattern A … B … C matches any sentence containing A, B, C in that order with arbitrary material between. This makes SPEC more expressive than n-grams for tasks like detecting cyberbullying constructions (「お前」… 「死ね」) or sarcasm where the cue words are far apart.

The architecture is language-independent: tokenization is pluggable, and the same engine has been applied to Japanese (with MeCab), Polish, and Ainu.

POST-AL · POS Tagger for the Ainu Language

NLP tooling for a critically endangered indigenous language of northern Japan.

Ainu language work

Ainu — the indigenous language of Hokkaido and Sakhalin — is classified by UNESCO as critically endangered. Fewer than a hundred fluent speakers remain, and almost no digital tooling exists for the language: no tokenizer, no POS tagger, no usable lexicon for downstream NLP. POST-AL is our small contribution toward closing that gap.

The tagger is paired with parallel work on (1) corpus collection from the Ainu Oral Literature Archive at the National Ainu Museum, (2) Romanization-to-katakana transliteration, and (3) machine-translation experiments with very small training sets. The broader project sits under the umbrella of language revitalisation technology — building enough of an NLP toolkit that future learners' apps, dictionaries and search interfaces have something to build on.

Contextual Appropriateness of Emotions

Not just what emotion was expressed — was it appropriate for the context?

Sentiment analysis usually stops at the label: "this utterance is angry." But anger in the right context is healthy; anger in the wrong context is harassment, irony, or trolling. This project adds a second judgement on top of standard affect analysis: was the expressed emotion appropriate given the situation it was uttered in?

The method combines an affect-analysis backend (ML-Ask) with a web-mining step that crawls the open web for sentences describing what people normally feel in situations of the same kind, then compares the elicited "expected" emotion distribution against the actual one. Implemented inside a conversational agent, the appropriateness signal lets the agent distinguish sincere expressions from sarcastic / inappropriate ones — and choose its response accordingly.

Automatic Evaluation of Conversational Agents

Using affect analysis as a proxy for user satisfaction with Japanese chatbots.

Building Japanese conversational agents is hard; evaluating them is harder. Post-conversation questionnaires are slow, expensive, and biased by recency. We propose using affect analysis during the conversation itself as a continuous, zero-cost proxy: how engaged is the user, emotionally, while they're talking to the agent?

Operationally, ML-Ask runs over the user's utterances in real time and outputs per-turn affective signals. Aggregated across a dialog, these correlate well (in our experiments) with the satisfaction scores users would have given on a post-hoc questionnaire — which suggests the affect-analysis trace is a reasonable continuous-time replacement for discrete questionnaire data.