Login

Project

#590 Enhanced Crash Risk Estimation in Urban Environments: Integrating Multi-Source Data and Advanced Modeling Approaches for the City of Pittsburgh


Principal Investigator
Sean Qian
Status
Active
Start Date
July 1, 2025
End Date
June 30, 2026
Project Type
Research Advanced
Grant Program
US DOT BIL, Safety21, 2023 - 2028 (4811)
Grant Cycle
Safety21 : 25-26
Visibility
Public

Abstract

Crash risk estimation is a critical aspect of transportation safety study, aimed at identifying high-risk locations and mitigating potential hazards/risks for road users. Limited resources can be allocated to those high-risk areas for effectively improving safety for the entire network. As travel demand grow, new mobility options emerges and traffic volumes increase, the complexity of factors influencing crash risk has also grown. Historically, crash risk estimation has been approached using statistical models, such as Safety Performance Functions (SPFs), which predict the expected crash frequency or risk on road segments based on exposure and other factors. However, the dynamic nature of urban environments and the growing availability of diverse data streams, such as video analytics, near-miss reports, real-time traffic data, and social media, present opportunities to enhance traditional methods. In particular, data directly measuring near-misses in addition to observed and rare accidents enable a more holistic and accurate measure of crash risks. These sensing and methodological advancements can improve model accuracy, reduce biases, and provide more granular insights into high-risk locations for both motorized and non-motorized traffic.

By leveraging long-term multi-source data related to crash risks, diverse sources such as police reports, traffic stops, and near-miss incidents from the city of Pittsburgh, and applying both statistical and machine learning approaches, this project aims to enhance the precision of crash risk estimations. It addresses key challenges related to data aggregation, modeling performance, and bias mitigation to provide more reliable safety insights that can inform future interventions.

Methodology and steps

Our approach is to estimate SPFs for both motorized and non-motorized crash risks in the city of Pittsburgh with multiple data sources, including historical crash data, UPMC urgent care data, police traffic stop reports, 311 near miss report, 911 Dispatch data, Waze reports (social media), Velo.ai video analytics data, and so on. Specifically, we process the video, medical, police and Waze data among all locations and in the past few years to either improve or validate the SPFs (namely current risk maps). We apply both statistical and machine learning models and select the best model based on model performance. We develop a new crash map and compare it with the the current risk map network. We explore the potential biases and how to improve the accuracy of crash risk estimation in general. This work will be conduct particularly for the City of Pittsburgh, but can be replicated in any other communities in general. The detail for each step is listed below.
Step 1: Conduct a comprehensive literature review on risk factors and video analytics algorithm for near-miss detection
We plan to carry out a comprehensive literature review on types of risk factors that make significant contributions to the crash risk (particularly near-misses related to vulnerable road users). This literature review helps collecting necessary datasets for building robust SPFs to estimate crash risk for streets in the following task. In addition, we will review the approaches of detecting incidents from the video data, which could help improve our algorithm in detecting near-miss cases from the camera videos.

Step 2: Data collection and video processing for near-miss incidents
In this step, we will collect the related datasets which could benefit our safety performance function estimation from multiple data sources. Table 1 lists potential datasets that could be applied in this research project. Furthermore, we plan to carry out video processing with advanced deep learning techniques to detect the near-miss incidents from the camera videos from selected intersections. The result of the number of near-miss incidents and driving factors could be either used to improve the SPFs or validate the estimated SPFs.

Table 1. Source and year of the datasets
Dataset	Data source	Year
High injury network	City of Pittsburgh	-
Crash data	PennDOT	2017-2023
UPMC urgent care	UPMC	In request
Police traffic stop report	City of Pittsburgh	2018-2022
311 near miss report	City of Pittsburgh	2019-2023
911 Dispatch data	WPRDC	2015-2022
Network of bike facilities	City of Pittsburgh	-
Built environment data	EPA	-
Socio-demographic data	NHGIS	-
Waze data	Waze for Cities	2018-2024
Cyclists’ waypoints and distances to vehicles	Velo.ai	2022-2024
Camera video from selected intersections	City of Pittsburgh	In request
Strava pedestrian and bicyclist count data	Strava Metro	2023-2024

Step 3: Estimate SPFs with both statistical and machine learning modeling approaches
We will estimate two sets of SPFs for motorized crash risk and non-motorized crash risk, respectively, with the data collected and processed in Step 2. We plan to apply both statistical (e.g., linear regression, Poisson regression, and Negative Binomial regression) models and machine learning models (e.g., random forest, gradient boosting decision tree, and neural network). We will compare the model performance and select the best one as our final model,

Step 4: Develop a crash map for all street segments in Pittsburgh, compare it with the original high injury network, and prioritize high-risk locations
We plan to use the estimated SPFs to predict the expected crash risk (both motorized and non-motorized) for all streets in the city of Pittsburgh and develop a crash map dashboard. We will compare the crash risk with those estimated in the high injury network and find the differences. We will explore the sources of bias and study how to improve crash risk estimation. Finally, we will rank the locations and prioritize high-risk ones for further examination.


    
Description

    
Timeline

    
Strategic Description / RD&T
Section left blank until USDOT’s new priorities and RD&T strategic goals are available in Spring 2026.
Deployment Plan
July – September 2024
1.	 Briefs and Demos to the City of Pittsburgh, UPMC and Velo

October – December 2024
1.	 Briefs and Demos to the City of Pittsburgh, UPMC and Velo

January – March 2025
1.	Briefs and Demos to PennDOT
2.	Briefs and Demos to other areas, such as FHWA Turner-Fairbank research center

April – June 2025
1.	Briefs and Demos to the City of Pittsburgh, UPMC and Velo
2.	Develop research report.
3.	Develop a prototype dashboard web-GIS application for the City
4.	Develop policy brief for legislators.

We will work closely with the City of Pittsburgh to implement this research, and potentially extend the work in other regions as well. The team consisting of the PI, research scientist and phd students will hold a bi-weekly coordination calls with the City, Velo.ai and UPMC doctors to discuss difficulties encountered and proposed solutions, and to outline plans for completing the scope of work, key milestones and deliverables. When performing the tasks, we will together meet with project managers, engineers and staff at those state agencies who provides feedback/comments for each month, to ensure the model development and testing are consistent with the City of Pittsburgh DOMI's view, and the tasks are aligned with the partners’ needs.

Upon the completion of this project, we plan to actively seek both industrial and federal funding based on this initial development. Our framework is applicable to any large traffic networks with safety data and multi-source data from social media. This generality will attract attentions from various public agencies and non-profit organizations to better deploy safe roadway infrastructure and road construction project. Potential funding agencies/collaborators include the Department of Transportation, Federal Highway Administration, TRB, state DOTs, MPOs/RPOs, and local non-profits and mobility service companies.

Expected Outcomes/Impacts
The goal of this research is to develop a comprehensive and universally applicable framework of safety risk models and tools for regional transportation networks, leveraging multi-source data and near-misses estimation. The framework will be accessible through open-source codes available online, complemented by a prototype web application. This application will leverage multi-modal data collected over several years from the City of Pittsburgh, offering user interfaces to manage visualizations of risks attributed to each data sources. It will also enable visualization of crash risks for selected road segments, categorized by motorized, nonmotorized, by time of day, by weekday and non-weekday, by road level, and by season. A case study in Pittsburgh will assist public agencies in setting guidelines for sensing, estimating and mitigating roadway risks. We plan to actively collaborate with local transportation agencies, including PennDOT, BikePGH, various departments in the City of Pittsburgh and UPMC, to assess their interest in these tools and integrate them into their regular operations.
Expected Outputs
Enhanced knowledge and awareness about the effects of various data sets on estimating crash risks on roads, especially on roads with a history of fatal accidents. Improved understanding of how multi-source data can aid public agencies in achieving greater transparency and making more informed decisions regarding roadway safety operations, especially in the context of evolving technologies and the increasing availability of diverse data sources.

TRID
Crash risk estimation faces several important challenges, which we summarize as follows after thoroughly reviewed TRID literature. 

First, crash risk estimation is a data-intensive task. As multiple factors may affect crash risk on the streets, it is important to aggregate the datasets of those influential factors in the modeling process. Traditional risk estimation methods use observed crash data on major roads only to generate risk maps (e.g. the risk maps created based on PennDOT crash data), but it may not effectively reflect the true risky location or time as a result of highly sporadic and random nature of crashes. This is evident by inconsistency between the City of Pittsburgh crash risk maps and anecdotally reported 311 calls regarding citizens’ risk assessment and witnesses near-miss occurrences. 

Second, the estimation performance is heavily depended on the modeling approach. Different modeling approaches may perform variously on different datasets and the types of crash risk (e.g., motorized and non-motorized). 

Finally, crash risk estimation is subject to selection bias. Since crash risk is usually fluctuating along the time, it would be possible that the selected period of crashes for modeling is located in high-crash-frequency or low-crash-frequency years, which may bias the estimation. 
To address these challenges, this research project applies multiple datasets, multiple modeling approaches, and long-term historical crash data (including near-misses estimation) to estimate the crash risk of the street segments in the city of Pittsburgh.

Individuals Involved

Email Name Affiliation Role Position
seanqian@cmu.edu Qian, Sean CEE PI Faculty - Tenured

Budget

Amount of UTC Funds Awarded
$100000.00
Total Project Budget (from all funding sources)
$200000.00

Documents

Type Name Uploaded
Data Management Plan dmp_4o0qFOE.docx April 15, 2025, 3:12 p.m.
Project Brief 2025_crashmap_Qian.pptx April 15, 2025, 3:13 p.m.

Match Sources

No match sources!

Partners

Name Type
UPMC Deployment Partner Deployment Partner
Velo.ai Deployment Partner Deployment Partner