Visualising Data ---------------- Pandas works well with Matplotlib and consequently Seaborn. Once the data is plotted outlyers can be quickly seen. Most plots are 2D, but sometimes it is important to see the relationship with more than 2 variables. Use colours to show string variables, categorized variables might be shown as different point types and changing the size of the points might show another variable. When first visualising data start with simple plots to highlight the strong and weak relationships and to check on the data. Once the data has been cleaned then the output can be selected and customised for publishing. In the following plot Plotly was used, this was saved as an html file which allows one to use cursors with balloon text, which is nice to have on a site such as this. .. raw:: html :file: ../pyplots/beers4.html .. figure:: ../figures/seaborn_function_overview_8_0.avif :width: 633 :height: 457 :align: center Seaborn Methods of plotting. The first row after the headers are default values. Seaborn works well when working interactively, if necessary add mplcursors to obtain a cursor as balloon text. .. figure:: ../figures/br_sea_hue.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code brewery_sea_hue.py .. literalinclude:: ../pyplots/brewery_sea_hue.py .. raw:: html
| .. note:: Plotting in Sphinx In order to display the actual Seaborn plots the modules are reimported for each script, when working interactively this is not required. The first seaborn script shows a similar plot to the plotly script, as one would use in an interactive session. The colours of the points are probably good enough for the final plot, Plotly colours were based on the seaborn palette so the customised seaborn plot has a matching palette. .. figure:: ../figures/br_sea_hue.avif :width: 641 :height: 500 :align: center .. raw:: html
Show/Hide Code brewery_custom_hue.py .. literalinclude:: ../pyplots/brewery_custom_hue.py .. raw:: html
| This seaborn plot shows how to customise the hue with a dictionary, adding overall title, axes labels and legend title. Mplcursors has been added to give balloon cursors, as a result ensure that the column **original_extract** has been converted to numeric, or else the float format will not work (used on the balloon cursor). The empty values in the column **calorific_value** throws the cursor indexing so two of the Pils points showed up as Pilsner Urquelle on the cursor instead of Zlaty Bazant or Topvar, so drop these rows:: df = df.dropna(subset=['calorific_value']) .. hint:: To View the Balloon Cursors Working Load the script into a Python session. .. sidebar:: Large Scatterplots with Overlapping Points If there is a lot of overlapping points on the scatterplot consider using pandas hexagonal plot:: ax = df.plot.hexbin(x='x', y='y', gridsize=20) plt.show() When there are several columns of data it may be useful to check on correlations across the columns. In this instance use pairplots, which are well supported in Seaborn. Each column is paired with every other column and then shown again with opposite axes. Pairing every column often makes little sense, select those columns that are related. .. figure:: ../figures/br_list_paired.avif :width: 750 :height: 750 :align: center .. raw:: html
Show/Hide Code brew_list_paired.py .. literalinclude:: ../pyplots/brew_list_paired.py .. raw:: html
| Change the diagonal from histogram to kernel density estimate (KDE) .. figure:: ../figures/br_list_paired_kde.avif :width: 750 :height: 750 :align: center .. raw:: html
Show/Hide Code brew_list_paired_kde.py .. literalinclude:: ../pyplots/brew_list_paired_kde.py .. raw:: html
| Change the columns to **AbV% AbW%**, alcohol by volume and weight in %, these ought to be in a straight line, check by drawing a regression line. .. figure:: ../figures/brew_list_scatter_regression.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code brew_list_scatter_regression.py .. literalinclude:: ../pyplots/brew_list_scatter_regression.py .. raw:: html
| Now try with the wort and beer densities. .. figure:: ../figures/brew_list_regplot_densities.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code brew_list_regplot_densities.py .. literalinclude:: ../pyplots/brew_list_regplot_densities.py .. raw:: html
| The densities show up the fact that there is some false data here, these need to be recalculated for Zlaty Bazant 12. Change the wort density to 1048.37 and the beer density to 1009.69. Zlaty Bazant 12 and Pilsner Urquelle 12 are similar. .. figure:: ../figures/brew_list_regplot_densities_rev.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code brew_list_regplot_densities_rev.py .. literalinclude:: ../pyplots/brew_list_regplot_densities_rev.py .. raw:: html
| If there is a strong relationship between 3 variables, consider a heat map. Just as done with categorized data the third variable shows as a color, but the colour is graduated to visualize the value of the third variable. Seaborn provides a good platform to plot the data, select the columns which may be related, then use the correlation function and plot the heatmap. .. sidebar:: Using Heatmaps to show Missing Data On large dataframes (df) it often is useful to highlight the degree of missing values by column:: sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis') Complete information shows purple and missing information as yellow. .. figure:: ../figures/brew_list_heatmap.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code brew_list_heatmap.py .. literalinclude:: ../pyplots/brew_list_heatmap.py .. raw:: html
| Positive correlation shows that the two independant variables move in the same direction, negative shows they move in opposite directions. The correlation function goes from +1.0 to -1.0. When using the correlation function both x and y axes contain the reduced number of columns, otherwise the heat map would have used all the columns in the y axis and just the reduced number of columns for the x axis:: dfch.corr() OE°P AbV% FP°C TMD°C OE°P 1.000000 0.770224 -0.880483 -0.865756 AbV% 0.770224 1.000000 -0.980466 -0.347714 FP°C -0.880483 -0.980466 1.000000 0.525171 TMD°C -0.865756 -0.347714 0.525171 1.000000 If the default colours are not clear enough use a divergent colour map. .. figure:: ../figures/brew_list_heatmap_div.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code brew_list_heatmap_div.py .. literalinclude:: ../pyplots/brew_list_heatmap_div.py .. raw:: html
| After using a heatmap consider using a clustermap. This highlights how similar features are grouped. .. figure:: ../figures/brew_list_clustermap_div.avif :width: 500 :height: 500 :align: center .. raw:: html
Show/Hide Code brew_list_clustermap_div.py .. literalinclude:: ../pyplots/brew_list_clustermap_div.py .. raw:: html
| Statisical Visualisation ^^^^^^^^^^^^^^^^^^^^^^^^ With larger dataframes some form of statistical visualisation may be required. Running KDE options has already been shown in principle together with linear regression data and confidence limits. This can be extended in Seaborn by using box and violin plots. The boxplot splits the distribution of the data into four quartiles, the middle two are the box element, the outer two are the whisker elements, outlyers are shown as points beyond the whiskers. The box is divided at the median. The ends of the whiskers show the calculated minimum and maximum values. All this assumes the data lies in a bell shaped distribution. .. figure:: ../figures/tips_box.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code tips_box.py .. literalinclude:: ../pyplots/tips_box.py .. raw:: html
| The violin plot shows the distribution of quantitative data across several levels of one (or more) categorical variables in order to compare distributions. Underlying the plot are KDE distributions so ensure that there is enough data so that the plots are not artificially smoothed. .. figure:: ../figures/tips_violin.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code tips_violin.py .. literalinclude:: ../pyplots/tips_violin.py .. raw:: html
| A nice feature of violin plots is that two categorical datatypes can be compared directly. A strip plot can be used on its own or in combination with a box or violin plot. .. figure:: ../figures/tips_box_strip.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code tips_box_strip.py .. literalinclude:: ../pyplots/tips_box_strip.py .. raw:: html
| When using the violin plot the mean/standard bar might be masked by the stripplot points, bring the stripplot forward by adding **zorder=1**. .. figure:: ../figures/tips_violin_strip.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code tips_violin_strip.py .. literalinclude:: ../pyplots/tips_violin_strip.py .. raw:: html
| To make a plot with multiple facets, it is safer to use catplot() than to work with FacetGrid directly, because catplot() will ensure that the categorical and hue variables are properly synchronized in each facet. .. figure:: ../figures/tips_catplot.avif :width: 1105 :height: 500 :align: center .. raw:: html
Show/Hide Code tips_catplot.py .. literalinclude:: ../pyplots/tips_catplot.py .. raw:: html
| A swarmplot is similar to the striplot, but the points have been automatically adjusted, so jitter is no longer required. .. figure:: ../figures/tips_swarmplot.avif :width: 640 :height: 480 :align: center .. raw:: html
Show/Hide Code tips_swarmplot.py .. literalinclude:: ../pyplots/tips_swarmplot.py .. raw:: html
| Storage Format -------------- Select the storage format that best suits your requirements. Both csv and pickle formats are easily loaded and saved from python scripts. For smaller dataframes csv is acceptable but think of different formats for larger dataframes. As always test thoroughly before changing formats.