spend less time munging, and more time analyzing
Scientists spend a lot of time “munging” data. Finding, cleaning, and managing datasets can take up the majority of the time it takes to complete an analysis. Tools that make the munging process easier can save scientists a lot of time.
We are tackling a small part of this problem in the context of the CDC’s National Health and Nutrition Examination Survey, a national study conducted every two years that gathers data on American’s health, nutrition, and exposure to environmental chemicals. Data from NHANES is used widely by environmental health scientists to study the U.S. public’s exposure to chemicals.
Figuring out how to set up a proper analysis of NHANES data can be tricky if you’re starting from scratch. That’s why we wrote RNHANES, an R package that makes it easy to download, analyze, and visualize NHANES data. RNHANES streamlines accessing the data, helps to set up analyses that correctly incorporate the study’s sample weights, and has built in plotting functions.
Let’s take the package for a spin by conducting a real-world analysis of NHANES data. Specifically, let’s characterize the U.S. population’s exposure to PFOA, a perfluo-alkyl substance (PFAS, also called “highly fluorinated compound”) that is linked to a number of adverse health effects.
exploration of maximum-likelihood estimation techniques
Data are censored when samples from one or both ends of a continuous distribution are cut off and assigned one value. Environmental chemical exposure datasets are often left-censored: instruments can detect levels of chemicals down to a limit, underneath which you can’t say for sure how much of the chemical was present. The chemical may not have been present at all, or was present at a low enough level that the instrument couldn’t pick it up.
There are a number of methods of dealing with censored data in statistics. The crudest approach is to substitute a value for every censored value. This may be zero, the detection limit, the detection limit over the square root of two, or some other value. This method, however, biases the statistical results depending on what you choose to substitute.
A more nuanced alternative is to assume the data follows a particular distribution. In the case of environmental data, it often makes sense to assume a log-normal distribution. When we do this, while we may not know what the precise levels of every non-detect, we can at least assume that they follow a log-normal distribution.
In this article, we will walk through fitting a log-normal distribution to a censored dataset using the method of maximum likelihood.
axis limits remove data by default
Setting axis limits in ggplot has behaviour that may be unexpected: any data that falls outside of the limits is ignored, instead of just being hidden. This means that if you apply a statistic or calculation on the data, like plotting a box and whiskers plot, the result will only be based on the data within the limits.
In other words, ggplot doesn’t “zoom in” on a part of your plot when you apply an axis limit, it recalculates a new plot with the restricted data.read more
sdftosvg is an NPM package for rendering chemical structures
If you look at the structures of chemicals within a family, you can see striking similarities. Take PCBs, which are industrial chemicals that were banned in the 1970s, as an example:
We wanted a way to generate beautiful chemical structure visualizations, like the ones you see above, for use in our materials. We developed a lightweight tool called
sdftosvg that does just that: it takes as input SDF files, downloadable from repositories like PubChem, and outputs SVG renderings.