Non-Syntactic Word Prediction for AAC
Karl Wiegand
Northeastern University
Boston, MA USA
Rupal Patel, Ph.D.
Northeastern University
Boston, MA USA
This work is supported by the National Science Foundation under Grant No. 0914808.

Target Systems: Example 1
DynaVox's Picture WordPower, part of the InterAACt framework

Target Systems: Example 2
SpeakForYourself, an iPad application

Background
Based on written language

Users currently select letters, words, or icons in syntactically correct order*

Non-syntactic input usually results in non-syntactic output
*H. Van Balkom and M. Welle Donker-Gimbrere. 1996. A psycholinguistic approach to graphic language use. Augmentative and alternative communication: European Perspectives, pages 153–170.
For ease of use, or speed of communication, this is not always true.  Much like, IM-speak, it affects the quality and tone of communication.

Observations
Speed of communication is important
Complete/correct utterances are important
Language style may be important

There are existing strategies for:
Completing words
Completing syntactic utterances
Detecting missing letters of words

Prior and Related Work
Missing function words (Compansion)
McCoy et al., 1998
Missing content words (memory-based language models using trigrams)
Van Den Bosch et al., 2006 and 2009
Word relationships and disambiguation based on grammatical characteristics (IR)
Tzoukermann et al., 1997; Allan and Raghavan, 2002

Prior and Related Work
Word relationships based on semantic characteristics and roles (IR)
Westerman and Cribbin, 2000; Fang and Zhai, 2006; Hemayati et al., 2007
Word relationships based on distance and collocation (IR)
Lin and Hovy, 2003; Lv and Zhai, 2009
Matiasek and Baroni, 2003 (moving window)
Jarvelin et al., 2007 (s-grams)

Motivation
Current completion and prediction strategies rely on syntactic input and word distance
N-gram statistics are widely available for well-ordered input
If the input isn't syntactically correct or well-ordered, can complete utterances stillbe predicted?

Exemplar 
"I like to play chess with my brother."
"My brother and I play video games."
"I play chess with my dad."

Input: like, play, chess, i, brother
Input: play, video games, i, brother
Input: play, chess, i, dad
Input: i, brother, ...

How can we track these relationships?

Possible approach
Sentences are one of the smallest units of language that are:
- Semantically coherent- Semantically cohesive
- Syntactically demarcated

... could be leveraged for prediction ...

Semantic Grams
A multiset of words that appear together in the same sentence. 

Sentence: "I like to play chess with my brother."
brother, chess (1)
brother, i (1)
brother, like (1)
brother, play (1)
chess, i (1)
chess, like (1)
chess, play (1)
i, like (1)
i, play (1)
like, play (1)

More on Sem-grams
Sentence Boundary Detection is fast and relatively accurate (> 98.5%)
Sentence-level co-occurrence with uniform weight applied to all relationships in a sentence
Order-independent and no null elements

Technical Problem Definition
Given:
Multiset of existing words E
Set of candidate words C

Output:
Most likely (argmax) candidate word c ∈ C - or -
Ranked list of candidate words c ∈ C

Four Prediction Algorithms
S1: Conditional independence of existing words to each other (naive Bayes)
S2: Random drawing of sem-grams from an existing pool
N1: Copy of S1, but with unordered n-grams; reward adjacency next to all existing words
N2: Reward adjacency next to at least one existing word (strength of n-grams?)

Corpus
Blog Authorship Corpus
140 million words
19,320 bloggers in August 2004
Age range of 13 - 48
Equally divided between males and females
Pre-processing:
Split sentences and words
Remove stop words
Stem words
Check stems for dictionary membership
Split by authors: 80% training, 20% testing
Plus-one smoothing on trained sem-grams and n-grams (bigrams)

Method
For every test sentence:
Process (split, stop, stem, and check)
Shuffle stems
Remove one (target)
Ask each algorithm to predict the missing stem by providing a ranked list of guesses

Evaluation (random 2000 sentences):
Score = position of target word; lower scores are better 

Method
Test sentences truncated to 20 words

N-gram algorithms seeded with top 10 unordered n-grams for each input word

Sem-gram algorithms seeded with top 10 sem-grams for each input word

Maximum of 190 candidate words to rank

Ranked lists truncated to 100; otherwise, considered a "failure to predict"

Results: Sample 1
Original Sentence:
“but i went to church yesterday with the fam.”

Target Stem: went
Input Stems: yesterday, church

N1 Candidate List: went, morn, today, go, attend, work, afternoon, church, got, day, ...

S1 Candidate List: went, go, church, today, got, day, like, time, just, well, one, get, peopl, ...


Results: Sample 2
Original Sentence:
“This semester Im taking six classes.”

Target Stem: class
Input Stems: take, semest, six

N1 Candidate List: next, month, class, hour, last, second, week, year, first, five, flag, ...

S1 Candidate List: class, month, year, last, time, one, go, day, get, school, will, first, ...


Results: Sample 3
Original Sentence:
“Hey, they’re in first, by a game and a half over the Yankees.”

Target Stem: game 
Input Stems: yanke, hey, first, half 

N1 Candidate List: game, stadium, like, hour, time, year, day, guy, hey, fan, say, one, two, ... 

S1 Candidate List: game, got, like, red, time, play, team, sox, hour, go, fan, one, get, day, ...

Summary of Results
N1
N2
S1
S2
# of Sentences
2000
2000
2000
2000
# Predicted
647
649
435
435
Average Score
16.26
19.70
9.04
12.67

Results by Sentence Length

Issues and Future Directions
Accuracy vs. Coverage
Use of bigrams
Seeding the candidate list
Computational requirements
Hybrid approaches:
Identical seed lists
Smooth from n-gram prediction to sem-gram prediction based on sentence length
Merge prediction lists
BAC provides age, gender, and occupation

Application: Single-Page App
Some options: (1) highlight predicted buttons for fast visual search; (2) allow unordered entry based on shortest path; and (3) auto-fill previously typed customizations used with similar words.

Application: Multi-Page App

Summary
Unordered/non-syntactic prediction is possible
N-grams can provide broad coverage
Sem-grams may provide better accuracy
Sem-grams may inform system behavior

Thank you!