=======================
Introduction DataFrame
=======================

When visualising data it often is important to first check the data, in which
case use a dataframe before plotting the results. Matplotlib is often
used, especially with more complicated plots, and does not always need to use 
a dataframe.

.. plot:: pyplots/sine.py
    :include-source:

When plotted on python there is a toolbar shown which gives some interaction
to the user. 

.. figure:: ../figures/matplotlib.avif
    :width: 640
    :height: 268
    :align: center

    Matplotlib Interactive Toolbar
    
    The cursor coordinates are not constrained to the plot.

Seaborn is built upon matplotlib, ensure the data is assigned to x and y,
otherwise it will show a "TypeError"::

    import matplotlib.pyplot as plt
    import seaborn as sns

    sns.set_theme(style="darkgrid")

    x = [1, 2, 3, 4, 5]
    y = [1, 5, 4, 7, 4]

    sns.lineplot(x, y)
    plt.show()
    
    TypeError: lineplot() takes from 0 to 1 positional arguments but 2 were given

corrected

.. plot:: pyplots/sea0.py
    :include-source:

When working with Altair one requires to explicitly state the source, then
the library interperets the data type. One can use either a dataframe or a
dictionary, when working with Jupyter lab it will show automatically::

    import altair as alt
    
    x = [1, 2, 3, 4, 5]
    y = [1, 5, 4, 7, 4]
    
    dict = {'x': x, 'y': y}
    
    alt.Chart(dict).mark_line().encode(x='x', y='y')

The data frame is optional for plotly::

    import plotly.express as px
    
    x = [1, 2, 3, 4, 5]
    y = [1, 5, 4, 7, 4]
    
    fig = px.line(x=x, y=y)
    fig.show()

The plot shows as a web page which has a toolbar with options,

    * Download plot as png
    * Zoom
    * Pan
    * Zoom in
    * Zoom out
    * Autoscale
    * Reset axes

It also has a popup cursor coloured in the plot colour showing the coordinates
of the data points/markers.

Just as we had done for Altair, Plotly can use a dictionary::

    import plotly.express as px
    
    x = [1, 2, 3, 4, 5]
    y = [1, 5, 4, 7, 4]
    
    dict = {'x': x, 'y': y}
    dict
    {'x': [1, 2, 3, 4, 5], 'y': [1, 5, 4, 7, 4]}
    
    fig = px.line(dict, x = "x", y = "y")
    fig.show()

Whenever using a dictionary or the plotting becomes complex it is best to change
over to a data frame, this applies to the other libraries as well::

    import plotly.express as px
    import pandas as pd
    
    df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                       'y': [1, 5, 4, 7, 4]})
    df
       x  y
    0  1  1
    1  2  5
    2  3  4
    3  4  7
    4  5  4
    
    fig = px.line(df, x = "x", y = "y")
    fig.show()

Notice how the data has been transformed from essentially a horizontal view,
lists or dictionary, to a vertical one, with **'x'** and **'y'** becoming
column names. The column of digits on the left is an index. Plotting library
examples often use this type of layout, which is the long layout. Normally the
long format is the most suitable but there are exceptions related to 
plotting methods.

* wide-form data 
    has one row per independent variable, with metadata recorded in the row 
    and column labels.

* long-form data 
    has one row per observation, with metadata recorded within 
    the table as values.

When the data is supplied as lists or mathematical/scientific formulae there
should be no reason to expect corrupted data, but for many applications this
cannot be assumed. This is where working in the dataframe using Pandas is
often necessary.

Many dataframes are stored as csv files or come from a website. If the 
complete file can be easily viewed on a screen then working with Pandas may
be not so necessary, however there may well be a case to check the data with
a plot to ensure that all is correct.