Loading Data#

Long and Wide Form#

../_images/long_form_data_structure_19_0.avif — Left hand image long-form, right hand image wide-form#

From the Seaborn documents.

Data should be in long format, this means that every column refers to a variable, and every row is the observed data. This follows the format of most databases.

year

month

passengers

0

1949

Jan

112

1

1949

Feb

118

2

1949

Mar

132

3

1949

Apr

129

4

1949

May

121

	year	month	passengers
0	1949	Jan	112
1	1949	Feb	118
2	1949	Mar	132
3	1949	Apr	129
4	1949	May	121

See how data is repeated within the year column, the month column repeats for each year.

Not all data arrives in this format, often it is in wide format. This means that the first row and first column are both variables and each cell is data. When there is a large number of variables then the long form is superior, but the wide form is probably easier to visualise as a table with two or three variables.

month

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

year

1949

112

118

132

129

121

135

148

148

136

119

104

118

1950

115

126

141

135

125

149

170

170

158

133

114

140

1951

145

150

178

163

172

178

199

199

184

162

146

166

1952

171

180

193

181

183

218

230

242

209

191

172

194

1953

196

196

236

235

229

243

264

272

237

211

180

201

month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
year
1949	112	118	132	129	121	135	148	148	136	119	104	118
1950	115	126	141	135	125	149	170	170	158	133	114	140
1951	145	150	178	163	172	178	199	199	184	162	146	166
1952	171	180	193	181	183	218	230	242	209	191	172	194
1953	196	196	236	235	229	243	264	272	237	211	180	201

None of the year or month data is repeated, but then the name passengers has been lost.

If data has to be converted to long format it might only need a simple transformation such as a table pivot or transpose. More complex changes may need the Pandas method pandas.DataFrame.melt().

For a more comprehensive discussion see Long-form vs. wide-form data

Data Loading#

Data can be loaded directly by typing it in, which probably applies to practice sessions, or more likely directly from a saved file. The structure is similar to treeview, there are columns with a header and rows of data, which can be expanded down the page. As a practical limit keep the number of columns to the screen width, although many tutorials have much larger sizes:

import pandas as pd

# create a dictionary
data = {'Product': ['Tablet','Printer','Laptop','Monitor'],
    'Price': [250,100,1200,300]
    }

# load directly into dataframe
df = pd.DataFrame(data)
# the dictionary keys become columns
print(df)

   Product  Price
0   Tablet    250
1  Printer    100
2   Laptop   1200
3  Monitor    300

Note that a dictionary is usually a wide format although with only two keys and columns it could be mistaken for a long format, see also melt.

The dataframe easily produces a list:

products_list = df.values.tolist()
print(products_list)

[['Tablet', 250], ['Printer', 100], ['Laptop', 1200], ['Monitor', 300]]

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once:

df1 = pd.DataFrame({'Product': ['Tablet','Printer','Laptop'],
    'Price': [250,100,1200]})

df2 = pd.DataFrame({'Product': ['Monitor'], 'Price': [300]})

# append is deprecated
# df3 = df1.append(df2, ignore_index=True)
df3 = pd.concat([df1, df2], ignore_index=True)
df3
   Product  Price
0   Tablet    250
1  Printer    100
2   Laptop   1200
3  Monitor    300

When concating use the same columns names, ignore_index=True is required or else the index no longer has unique values, alternatively load a list:

l1 = ['Mouse', 10]

# locate the end of the dataframe
# df3 is overwritten
df3.loc[len(df3)] = l1
df3
   Product  Price
0   Tablet    250
1  Printer    100
2   Laptop   1200
3  Monitor    300
4    Mouse     10

Load csv data directly from disk or from a known URL source:

df = pd.read_csv("file_path/beer_list.csv")

url = 'https://raw.githubusercontent.com/Vibe1990/Netflix-Project/main/netflix_title.csv'
df = pd.read_csv(url)

Other formats are accessible although additional dependancies might be necessary

Jason
read_json()
HTML
read_html()
XML
read_xml()
SQL
read_sql()
Excel
read_excel()

Loading Data#

Long and Wide Form#

Data Loading#

This Page