Speakers have strong intuitions about whether a novel word is "possible" or not: for instance, English speakers judge "blick" to be possible (but unattested) whereas "bnick" is simply impossible. At the same time, both are well-formed in, e.g., Moroccan Arabic. In a number of recent studies, tools from NLP such as n-gram and maximum entropy models are trained on dictionary data and used to model speakers' wellformedness judgements, and in some cases, these model scores and human judgements are quite closely correlated. I will show, however, that these models are no better than simple baselines. Furthermore, I will argue that even if these models performed impressively, they are implausible descriptions of how human actually make these judgements.
Identifying a Topic Mention in Private Conversations: A Semi-Supervised Approach.
The ability to reliably spot conversations on a specified topic has many potential applications. Our work is motivated by the need to quantify the distribution of discussion on health-related topic in our subjects' everyday conversations. This task is unlike keyword spotting or document retrieval where the input is defined by a few words of interest. Instead, we are interested in identifying conversations on a topic of interest. Furthermore, private conversations preclude the possibility of creating a manually annotated corpus for learning a supervised classifier or a topic model. We describe a semi-supervised approach that is based on latent Dirichlet allocation (LDA). We initialize the LDA algorithm with a few keywords on the topic of interest and iteratively refine the vocabulary associated with the topic. We compare the effectiveness of our approach with other alternatives on two corpora -- a publicly available elicited corpus of telephone conversations (Fisher corpus) and our corpus of everyday telephone conversations from 46 native English speakers, 65 and older, over a span of 6 months to a year. We report performance on both corpora and find that our semi-supervised approach is very effective in this task.
Minimally-Obtrusive Respiratory Cycle Tracking for Assessing Sleep-Disordered Breathing Severity
Sleep-disordered breathing (SDB) is a highly prevalent condition associated with many adverse health problems. As the current means of diagnosis (polysomnography) is obtrusive and ill-suited for mass screening of the population, we explore a non-contact, automatic approach that uses acoustics-based methods. We present a method for automatically classifying breathing sounds produced during sleep. We compare the performance of several acoustic feature representations as well as model topologies for detecting diagnostically-relevant sleep breathing events to predict overall SDB severity. To address environmental noise, we use a noise reduction algorithm based on adaptive spectral mean subtraction. Our subject-independent method tracks rests in the breathing cycle with 84–87% accuracy, and predicts SDB severity at a level similar to full-night clinical polysomnography.
Distributional semantic models for the evaluation of disordered language
Atypical semantic and pragmatic expression is frequently reported in the language of children with autism. Although this atypicality often manifests itself in the use of unusual or unexpected words and phrases, the rate of use of such unexpected words is rarely directly measured or quantified. In this paper, we use distributional semantic models to automatically identify unexpected words in narrative retellings by children with autism. The classification of unexpected words is sufficiently accurate to distinguish the retellings of children with autism from those with typical development. These techniques demonstrate the potential of applying automated language analysis techniques to clinically elicited language data for diagnostic purposes.
Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation
The detection of genomic structural variations remains one of the the most difficult challenges in analyzing high-throughput sequencing data. Recent approaches have demonstrated that considering multiple mappings of all reads, rather than only uniquely mapped discordant fragments, can improve the performance of read-pair based detection methods. However, the computational requirements for storing and processing data sets with multiple mappings can be formidable. Meanwhile, the growing size and number of sequencing data sets have led to intense interest in distributing computation to cloud or commodity servers. MapReduce is becoming a standard framework for distributing processing across such compute clusters. In our recent paper, we described a novel conceptual framework for structural variation detection algorithms in MapReduce based on computing local features along the genome. In this framework, we developed and evaluated a distributed deletion-finding algorithm based on fitting a Gaussian mixture model to the distribution of mapped insert sizes spanning each location in the genome. On simulated and real data sets, our approach achieves performance similar to or better than a variety of popular structural variation detection algorithms, including read-pair, split-read, and hybrid approaches, and performs well across a wide range of deletion sizes. In particular, our algorithm excels at discovering deletions of size 40bp-100bp in repetitive regions of the genome. In addition, our algorithm can accurately genotype heterozygous and homozygous deletions from diploid samples. I will discuss the current state of our algorithm and discuss our future plans for the software.
Imposing marginal distribution constraints on language models
In this talk, I'll discuss some new work on reestimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed in Chelba et al. (2010) (and elsewhere) when pruning Kneser-Ney models, while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity reductions when used with various baseline smoothing methods. We discuss future directions, including the potential to integrate with pruning and to distribute the algorithm. The algorithm will be released in the next version of the open-source OpenGrm ngram software library.
Efficient Latent-Variable Grammars : Learning and Inference
Syntactic analysis is important for many NLP tasks, but constituency
parsing is computationally expensive, often prohibitively so. In
this work, we examine the barriers to efficient context-free
processing, and present several approaches to improve throughput and
We propose three methods of incorporating efficiency concerns into
the process of training latent-variable grammars, and present
preliminary results indicating the potential of these
approaches. First, we propose text-normalization prior to grammar
induction, and demonstrate that even simple normalizations can
reduce the grammar size considerably with minimal impact on
accuracy. Second, we propose a modeling approach, predicting
inference time from attributes of the grammar, and incorporating
those predictions in the optimization criteria during split-merge
grammar training. We present preliminary trials demonstrating a
speedup of 30% with minimal accuracy loss. Finally, we propose a
discriminative criteria for selecting state-splits, allowing a controlled
tradeoff between speed and accuracy in the learned grammar.