Enabling Analysis of Disparate Safety Data Sets: The UL Safety Data Lake
The UL Data Science team works to simplify research and analysis of safety information
In July 2018, the Underwriters Laboratories Data Science team introduced a new portal that consolidates open source data from multiple safety-focused databases into a single environment. Named the UL Safety Data Lake, the platform helps UL colleagues who are not data savvy to obtain access to multiple safety data sources- curated for content and quality- to support their analysis and research.
Amalgamating data from disparate sources
The Data Lake contains information from the U.S. Consumer Product Safety Commission (CPSC) SaferProducts.gov website, the National Electronic Injury Surveillance System (NEISS), the Injury/Potential Injury Incident File (IPII), and the Death Certificate file. Additional data sets within the Data Lake include recall data from the EU’s Safety Gate, the FDA’s Medical Device incident database (Manufacturer and User Device Experience – MAUDE), and Pipeline and Hazardous Materials Safety Administration (PHMSA) incident data. The PHMSA data is part of ongoing UL research into lithium-ion battery incidents on aircraft.
The site provides data visualizations and summary statistics for each of the data files, with the ability to search and query the data in straightforward ways (keyword search, for example). UL colleagues can also use an Application Programming Interface (API) to bring the data into Excel spreadsheets and Power BI for further analysis. The tool was launched as a resource for UL employees but could be expanded for public accessibility.
If the phrase “data lake” is new to you, you’re not alone. It’s an emerging concept within the data science community. Unlike a traditional data warehouse which is highly structured, a data lake has minimal structure that enables data to be intermingled to allow tools- like machine learning algorithms- to work on the whole set, rather than on structured subsets. Ultimately, it is more fluid, flexible and user driven.
Where this work will take us
Future capabilities envisioned for the UL Safety Data Lake include: natural language processing of incident narratives based on user defined subjects, comprehensive search capability across all data sets simultaneously, and recommendation engines that sort through data to find incidents of a similar nature to the ones currently being viewed. The team will also be adding additional CPSC data (violations, fines, etc.), and plan to extend the concept to other sources of open data, such as National Fire Incident Reporting System (NFIRS), the European Injury Data Base and more.
Although the UL Safety Data Lake is not currently available as a public resource, the Data Science team is exploring the possibility. In addition, the team is investigating the development of other potentially public-facing platforms to increase the usability of data to drive comprehensive research and analysis of safety data.
Interested in learning more about the UL Safety Data Lake and the Data Science team? Contact us.
- A data lake intermingles multiple sources of data to allow for simultaneous analysis of the entire set.
- Data from seven unique sources comprises the UL Safety Data Lake with additional data sources and capabilities to be added over time.