#46 Multimodal Detection of Driver Distraction

Principal Investigator: Maxine Eskenazi
Status: Completed
Start Date: Jan. 1, 2017
End Date: Aug. 31, 2018

Project Type: Research Advanced
Grant Program: MAP-21 TSET National (2013 - 2018)
Grant Cycle: 2017 TSET UTC
Visibility: Public

Abstract

Vehicles have an increasing number of safety devices that have dramatically driven down the number of accidents. Yet it is disturbing that a recent rise in accidents has been noted. Driver distraction is causing more accidents despite legislation in many states to prevent the use of distractions (the cellphone). Since legislation has not stopped drivers from using their cellphones to talk or to text, we propose that drivers continuing to use their phones should be warned when that use is creating a dangerous situation. This project concerns the development of automatic detection of driver distraction from speech and video. Eventually installed in a vehicle or in an app, the detector would either emit an alarm or shut an app down when distraction is detected so that the driver can attend to the road. The project uses both speech (the driver’s hesitations, choice of words, and other changes in normal speaking habits) and vision (the driver’s head turning, looking down, to the left or to the right) as input information. The detectors will be trained on data gathered using a driving simulator and tasks that promote subject distraction. Not only will this project deliver automatic distraction detection algorithms, but it will also make its two driving databases available publicly so that others can use it as well to research distraction.

Description

Distraction is a complex human affect. It has some causes that interrupt the chain of human attention to tasks like guiding a vehicle, and leaves some markers of its existence, like pauses in speech. Different individuals experience distraction to varying degrees and express it in many different ways. We have begun our exploration with the most common markers that can be found in speech. In order to have a more robust representation of distraction, we now add combine the markers that can be found in visual cues, such as head turning. We have noted that drivers may only register distraction by the turning of their head towards a cellphone, food, or a passenger. Some may not turn their heads, but be concentrating on something they are listening to and show this by hesitations in their speech. For such a complex set of behaviors, we must use a broad set of detectors, a set that is miltimodal.
Better understanding the human behavior related to distraction is the central question of this proposal. Building upon our expertise in the automatic analysis of human nonverbal behavior, we propose to study both the visual and the acoustic behavior related to distraction and build new algorithms to integrate these two modalities for robust prediction. Our previous research has shown promising results in analyzing the acoustic behavior of drivers by modeling the prosodic cues in their speech. We are currently building a new dataset which will not only record the driver’s speech but also their visual behavior. This dataset will be enhanced with contextual cues about the state of the car and the activity log of the driver (e.g., cellphone activities).
We propose a two-year research plan to study and model the multimodal behavior (i.e., both visual and acoustic) of the driver from this new dataset. Our first year will focus on visual behavior, starting by studying behavior related to attention. We plan to analyze the head and eye gaze behavioral patterns of the driver. We already have the technology for automatic sensing of these visual cues. Dr. Morency’s team has built the OpenFace software which automatically analyzes images from a video feed to extract the head position and orientation as well as the eye gaze direction [WACV 2016]. This software was shown to outperform previous approaches on many publicly available datasets. Attention prediction will be inferred by integrating these two sources of information. We also plan to study the facial expressions that occur during the driving sessions. We will first focus on negative affect such as frowning where the facial expression is mostly targeted around the eyebrows. A second category of facial expressions will be related to surprise where for example the driver has raised eyebrows. These two categories of facial expressions will be starting points for our analysis. We plan to study the dynamics of these expressions such as how quickly a person raises their eyebrows or frowns. We will continue our study by jointly analyzing the driver’s gaze and facial expressions. It will allow us to better understand the relationship between attention and facial display.
At the end of the first year, we also plan to integrate the acoustic behavior previously identified for distraction with the new visual behaviors identified during this first year. We plan to take advantage of Dr. Morency’s expertise in multimodal machine learning to fuse the information from both acoustic and visual modalities. For example, the Multi-View Conditional Random Field will allow us to model not only the interaction between these two modalities (referred as two “views” in this model”) but also to model the temporal dynamics of these behaviors [CVPR 2012]. We also plan to explore more recent work on recurrent neural networks, such as long short-term memory models, which have shown promising results for modeling long-range dependencies in sequenced data.
In the second year, we will focus on putting our analysis of the visual and acoustic behaviors into context. More specifically, we plan to analyze two contextual sources: (1) the linguistic content of the speech and (2) the related activities occurring in the car, such as cellphone activities (texting, using an app that displays items on the screen, using an app that interacts by speech only). In cases where the driver is distracted, the speech may become less cohesive and sentences less well-constructed. We can use measures of cohesiveness that Dr. Eskenazi is presently working on to determine when the cohesive nature of a subject’s speech changes. When a speaker can devote most of their attention to their speech, it is well-planned and constructed. Sentences have some structure, such as a subject, verb and object. Sentences are also joined to one another through co-reference. When a speaker can no longer devote as much attention and planning to their speech, sentences become fragmented and the meaning is sometimes lost from one sentence to the next. To do this, we will use the Stanford Parser, which will analyze the output of automatic speech recognition. This may come from the Google, the Microsoft or the PocketSphinx recognizer or some combination of the three in order to provide the best quality estimate of what was said and to get a result rapidly.
In the later part of the second year, we plan to contextualize the analysis of visual, acoustic and linguistic behaviors with the other activities happening in the car such as use of the gas pedal and the steering wheel. Since we are logging and time-syncing cellphone events during the dataset recordings, we plan to study the correlation of these contextual events with the driver’s multimodal behavior. For example, if the nature of the information displayed on the cellphone screen changes, the gaze patterns of the user may also change. We plan to employ temporal pattern analysis (e.g., T-Patterns) to study the contingencies between the context and the driver’s behavior. We will automate our models into a meta-modal detector and assess its precision.
Concurrently, the database we have collected in 2016 will be structured and publicly released.

Timeline

January 1, 2017 – December 31, 2018.

Strategic Description / RD&T

Deployment Plan

During the course of our work, we will continue to demonstrate our findings at conferences and to key visitors. We will search for partners who want to use the databases that we have developed. We will also search for partners who would like to implant our algorithms in vehicles, such as car makers.

Expected Outcomes/Impacts

In 2017, we will develop algorithms to detect changes in subject head movement, gaze and facial expressions. We will compare the output of these detectors to the simulator data where each subject indicated at what point they were distracted. We will assess the quality of the detectors. We will also format and publicly release the 2016 simulator driving database.
In 2018, we will integrate the vision detectors with the speech detectors and add linguistic analysis. We will compare the output of this meta-detector to the simulator data where each subject indicated at what point they were distracted.

Expected Outputs

TRID

Individuals Involved

Email	Name	Affiliation	Role	Position
cahuja@andrew.cmu.edu	Ahuji, Chaitanya	LTI	Other	Student - Masters
yulund@andrew.cmu.edu	DU, Yulun	LTI/MLT	Other	Student - Masters
max@cs.cmu.edu	Eskenazi, Maxine	LTI/SCS	PI	Faculty - Research/Systems
morency@cs.cmu.edu	Morency, Louis-Philippe	LTI/SCS	Co-PI	Faculty - Research/Systems

Budget

Amount of UTC Funds Awarded

$127500.00

Total Project Budget (from all funding sources)

$127500.00

Documents

Type	Name	Uploaded
Progress Report	46_Progress_Report_2017-09-30	Sept. 26, 2017, 5:31 a.m.
Progress Report	46_Progress_Report_2018-03-31	March 20, 2018, 6:23 a.m.
Final Report	UTC_project_13_Multimodal_Detection_of_Driver_Distraction_-_final_report_4qxRnW0.pdf	Oct. 9, 2018, 7 a.m.
Publication	Multimodal Polynomial Fusion for Detecting Driver Distraction.	Dec. 8, 2020, 9:55 a.m.

Match Sources

No match sources!

Partners

No partners!