Loading Data#

Long and Wide Form#

../_images/long_form_data_structure_19_0.avif

Left hand image long-form, right hand image wide-form#

From the Seaborn documents.

Data should be in long format, this means that every column refers to a variable, and every row is the observed data. This follows the format of most databases.

year

month

passengers

0

1949

Jan

112

1

1949

Feb

118

2

1949

Mar

132

3

1949

Apr

129

4

1949

May

121

See how data is repeated within the year column, the month column repeats for each year.

Not all data arrives in this format, often it is in wide format. This means that the first row and first column are both variables and each cell is data. When there is a large number of variables then the long form is superior, but the wide form is probably easier to visualise as a table with two or three variables.

month

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

year

1949

112

118

132

129

121

135

148

148

136

119

104

118

1950

115

126

141

135

125

149

170

170

158

133

114

140

1951

145

150

178

163

172

178

199

199

184

162

146

166

1952

171

180

193

181

183

218

230

242

209

191

172

194

1953

196

196

236

235

229

243

264

272

237

211

180

201

None of the year or month data is repeated, but then the name passengers has been lost.

If data has to be converted to long format it might only need a simple transformation such as a table pivot or transpose. More complex changes may need the Pandas method pandas.DataFrame.melt().

For a more comprehensive discussion see Long-form vs. wide-form data

Data Loading#

Data can be loaded directly by typing it in, which probably applies to practice sessions, or more likely directly from a saved file. The structure is similar to treeview, there are columns with a header and rows of data, which can be expanded down the page. As a practical limit keep the number of columns to the screen width, although many tutorials have much larger sizes:

import pandas as pd

# create a dictionary
data = {'Product': ['Tablet','Printer','Laptop','Monitor'],
    'Price': [250,100,1200,300]
    }

# load directly into dataframe
df = pd.DataFrame(data)
# the dictionary keys become columns
print(df)

   Product  Price
0   Tablet    250
1  Printer    100
2   Laptop   1200
3  Monitor    300

Note that a dictionary is usually a wide format although with only two keys and columns it could be mistaken for a long format, see also melt.

The dataframe easily produces a list:

products_list = df.values.tolist()
print(products_list)

[['Tablet', 250], ['Printer', 100], ['Laptop', 1200], ['Monitor', 300]]

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once:

df1 = pd.DataFrame({'Product': ['Tablet','Printer','Laptop'],
    'Price': [250,100,1200]})

df2 = pd.DataFrame({'Product': ['Monitor'], 'Price': [300]})

# append is deprecated
# df3 = df1.append(df2, ignore_index=True)
df3 = pd.concat([df1, df2], ignore_index=True)
df3
   Product  Price
0   Tablet    250
1  Printer    100
2   Laptop   1200
3  Monitor    300

When concating use the same columns names, ignore_index=True is required or else the index no longer has unique values, alternatively load a list:

l1 = ['Mouse', 10]

# locate the end of the dataframe
# df3 is overwritten
df3.loc[len(df3)] = l1
df3
   Product  Price
0   Tablet    250
1  Printer    100
2   Laptop   1200
3  Monitor    300
4    Mouse     10

Load csv data directly from disk or from a known URL source:

df = pd.read_csv("file_path/beer_list.csv")

url = 'https://raw.githubusercontent.com/Vibe1990/Netflix-Project/main/netflix_title.csv'
df = pd.read_csv(url)

Other formats are accessible although additional dependancies might be necessary

  • Jason

    read_json()

  • HTML

    read_html()

  • XML

    read_xml()

  • SQL

    read_sql()

  • Excel

    read_excel()