Properties of the elements: data collection and ggplot2 periodic table
periodic tableggplot2web scrapingR-bloggers
Nov 16, 2014
We will use rvest and magrittr to scrape elemental properties from periodictable.com, and then plot the data in the form of the periodic table using ggplot2.
It is very common to visualise periodic trends in the elemental properties by using the well-established IUPAC periodic table as a canvas. In this post we will demonstrate how common R tools can make the job quickly and near effortless, once the data is at hand.
Collecting elemental data
The website periodictable.com lists a large number of properties for each element, and the data is displayed as not overly complicated HTML tables. The website states that it was created with Mathematica (by Wolfram Research), but even so, the quality of the data on the website is not too good. It appears to have suffered from whatever conversion was applied from the original Mathematica format.
In any case, in this post the quality of the data is secondary to the goal of achieving a working proof of concept. Plus, our own scraping will inevatibly degrade data quality even more.
Starting from this URL, we crawl the page and collect the URLs to all property pages, along with the name of each property. We put them together in a dataframe, so it is clear which URL belongs to which elemental property.
All elemental properties
The list of elemental properties available from periodictable.com. As collected (may contain some duplicates, these will be removed later).
Collect the values for each property
Each property, e.g., density, will have a value for each element of the periodic table. This value is just a string, and depending on the type of the property, it may be just a number, or a quantity with a unit, or some text with various attributes.
At this stage, we don’t mind the internal structure of the value, we just want to collect it. We will collect all property values into one dataframe.
Looking slightly ahead, you will realise that the only way to allow different types (character, numeric, etc.) in different columns is if each property is mapped onto one column. Thus, we will build the dataframe with properties as columns, and elements as rows.
Some property pages (listed below) are difficult to parse (due to the way the data is presented on the webpage). For now, we just skip those pages (no big loss).
Ok, now we have the skeleton for a dataframe. Let’s populate it.
Elemental property data collected! Next, we need to sanitize the data. As we mentioned earlier, the values may be of several different types. The biggest job is to deal with quantities and units.
There might well be an easier way to do this, but to make the coding easy, we will create two empty dataframes (based on the existing elements_raw, with identical dimensions and column names), and place quantities (and unit-less values) into one, and units into the other.
The tricky part will be to correctly determine what part of the values string is a quantity and what part is a unit. We will have to resort to regular expressions. Because of that, the following code is probably the most likely to break if anything should change at the data source.
Clean up all the value strings
We will use regular expressions to identify the different parts of each value string. See comments in the code below. Unfortunately, we found no way to clean up the value strings of each type of column (non-numeric, numbers-only, and quantity+unit) without using explicit, static assignments, as seen in the following three chunks. It is ugly, but it works, as long as no properties (i.e., columns) are added, removed, or renamed at the source.
Cleaning up the units
Now that we have separated the units and the quantities into separate strings (dataframes, actually), let’s have a look at the units, to see if there’s any fixing needed.
As you can see, some properties (for example HalfLife) use more than one unit. This is problematic, since we will only be plotting against one y-axis, not several. So we will have to convert all such occurrences to their standard units, which means we have to take the numerical conversion of the quantity into consideration as well. Let’s do it.
First, let’s see all units in the set.
Some are equivalent to each other, and others can be reduced to base SI units, according to the following list, which we put together manually after inspecting the output above.
Someone should probably tell periodictable.com that they have a typo in some of their values with units of KJ/mol instead of kJ/mol.
Building up a match-and-replace dataframe to convert non-standard or simplifiable units to standard units. This will help simplify any visualisations as well as possible comparisons between properties.
Visualising elemental data
First, an abridged look at the data itself.
Next, let’s try a typical plot of a property against atomic number.
Another plot of a property (density, this time) against atomic number.
Ok, it looks as one would expect. Not very exciting, though.
We are now ready to attempt to overlay elemental properties on top of the periodic table. I think this is a worthwhile enterprise, because I have so far yet to see a way to programmatically create a periodic table and overlay data visually on it.
Brilliant, because for the first time it gave writers (well, at least writers familiar with LaTeX and TikZ) the ability to easily create our own periodic tables, as well as to customise them. Frustrating, because there was no easy way to tie the generation of the TikZ-based periodic table into available periodic trend data. Of course, this was an inherent shortcoming of TikZ/LaTeX more than anything else.
In this work, I have used R with the ggplot2 plotting system to achieve this the visualisation of arbitrary elemental data over the canvas of the periodic table.
Final preparations before plotting
The periodic table places the elements next to each other, organised in rows (periods) and columns (groups) on a two-dimensional plot. Obviously, each element’s position on this plot is completely specified by its group and period.
We will therefore introduce new group and new period variables, specifically for the purpose of plotting. We will not use the original group and period data for two reasons: some elements were not assigned a group and period in the original dataset, and we want some flexibility to adjust the coordinates to control the aesthetics of the plot.
With just a few lines, we can make a very quick-and-dirty periodic table (sans data) using ggplot2:
Plotting continuous elemental properties
Let’s use the periodic table we created with ggplot2 to plot some of the continuous variables.
A short explanation is in order. The boxes are inherently square, but we adjusted the plot dimensions (fig.width and fig.height in knitr parlance) to make sure the final appearance of the boxes is indeed square. Colour is mapped to the numeric values of the plotted property, using the built-in ggplot2::scale_colour_continuous() function. And we positioned the required legend over the transition metal block, to conform with most other periodic tables out there. Grayed-out elements lack data for the shown property.
Here are some more periodic tables overlaid with other continuous elemental data.
We could easily change the print size of these plots, as well as export them to most common image formats. We could also easily switch from the current knitr and markdown document system to knitr and LaTeX to take advantage of the excellent math and symbol support of LaTeX.
Plotting discrete elemental properties
Some more with discrete variables:
We have successfully demonstrated how ggplot2 may be used to programmatically generate plots of elemental properties in the form of the typical periodic table.
To do this, we also had to scrape and collect a database of elemental property data from public webpages.
Our hope is that this will make it easier for chemists and others to generate periodic tables of whatever trend they wish to visualise.
To make it easier to repeat this code, we have included an MWE below.
Minimal working example
Use this ggplot2 code to generate a periodic table visualising a continuous variable.