============= Cleaning Data ============= Before data cleaning check what state the dataframe is in:: df.info() RangeIndex: 4 entries, 0 to 3 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Product 4 non-null object 1 Price 4 non-null int64 dtypes: int64(1), object(1) memory usage: 192.0+ bytes When cleaning data save to a new dataframe or overwrite the existing dataframe, otherwise the result is lost. If you want to keep a working copy in memory give the action a new name (df1, df2 ..), when you want to keep the result and overwrite the working dataframe save to the same name or add the parameter **inplace=True** (this last command is discouraged). Right Character Set ^^^^^^^^^^^^^^^^^^^^ Python 3 works in the UTF-8 character set, if you experience errors associated UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 .... or something similar check that the file is actually UTF-8. The file may still be in ASCII. When checking with **chardet** ensure that the program reads the contents of the `file `_ and not the name of the file:: with open('beer_list.csv') as f: print(f) <_io.TextIOWrapper name='beer_list.csv' mode='r' encoding='cp1252'> "beer_list.csv" is stored on a windows machine, hence encoding='cp1252':: import chardet filecsv_list = ['beer_list.csv'] for csv in filecsv_list: with open(csv,'rb') as f: data = f.read() # or a chunk, f.read(1000000) encoding=chardet.detect(data).get("encoding") print(encoding) b=pd.read_csv(csv,encoding=encoding,header=None,sep=',',engine='python') UTF-8-SIG If the answer had returned 'iso-8859-1' (equivalent to latin-1) then it needs converting. External sites may require **encoding='iso-8859-1'**, internal files can be converted by most file editors. Missing Values ^^^^^^^^^^^^^^ First check whether there are any missing values:: df.isnull() Product Price 0 False False 1 False False 2 False False 3 False False This may be all that is required in a small dataframe, for a larger dataframe use a summary query:: df.isnull().sum() Product 0 Price 0 On some dataframes there may be columns that are of little interest and can be safely removed:: df.dropna(subset=['column_name'], inplace=True) If there is a reason to hold back then:: df = df[df['column_name'].notna()] Wrong Data Type ^^^^^^^^^^^^^^^^ See whether the data types correspond to the expected. Uncleaned data has a lot of unassigned data type. The general catchall is **object** for anything not a number, which is assigned int64 or float64:: df.dtypes Unnamed: 0 int64 Product object Price int64 dtype: object Make a small dataframe with an error:: data = {'a': [0, 1, 2, 'a'], 'c': ['a', 'b', 'c', 'd']} dferr = pd.DataFrame(data) dferr a c 0 0 a 1 1 b 2 2 c 3 a d dferr.dtypes a object c object dtype: object Both columns are object as they either have mixed dtypes or are characters. Clean up column "a", if on a large dataframe it may not be so easy to spot:: dferr['a'] = pd.to_numeric(dferr['a'], errors='coerce') dferr a c 0 0.0 a 1 1.0 b 2 2.0 c 3 NaN d dferr.dtypes a float64 c object dtype: object Column "a" now shows as float64, as the replacement **NaN** is a float type. If the command raises a warning:: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead df.loc[pd.to_numeric(df['column_name'], errors='coerce').isna(), 'column_name'] Our original data had integers, a different replacement command is necessary .. sidebar:: Pandas NaN and Numpy nan Both refer to ``Not a Number``. Within Pandas it is used on a column that is meant to contain numerical data. Both *NaN* and numpy *nan* are floating point representations of a field that is not numeric. Column *a* should have an integer type, but the first correction attempt inserted *NaN* and the remaining fields became floating point numbers. Next Numpy's nan is used to search for the error and *-1* is used as a replacement , this time column *a* contains only integers and the type is corrected. :: dferr['a'] = dferr['a'].replace(np.nan, -1).astype(np.int64) dferr a c 0 0 a 1 1 b 2 2 c 3 -1 d dferr.dtypes a int64 c object dtype: object The numbers are now integers and the cleaned up data is **-1** which should be different to the other valid entries. Date Time Columns ^^^^^^^^^^^^^^^^^^ Whittle down those columns with object dtypes, using a similar cleanup command to to_numeric above, clean up date columns:: pd.to_datetime(df['date_col'], errors='coerce').isnull().value_counts() This time we are counting the number of errors and successes that exist. Knowing the dataframe size the result will show **False df_size** when everything is correct and the number of errors **True err_size**. Also a new column will be generated whilst searching for errors, this allows the old data to remain in the original column while the new column will display the errors as NaT (not a time). .. sidebar:: Detecting False Dates The query shows whether an error occurred, therefore an error occurred or not it displays the opposite of a correct answer. Hence *False* should have a large number, if clean the same as the dataframe size, whereas *True* should be small the number of errors found. Date time errors require a bit of unravelling. There can be errors caused by the wrong format, pandas reads the year first, then month then day in default mode. Since it is programmed in America it interprets 03-04-2021 as the 4th March and not the 3rd of April:: pd.to_datetime('2014-04-03') Timestamp('2014-04-03 00:00:00') pd.to_datetime('03-04-2014') Timestamp('2014-03-04 00:00:00') pd.to_datetime('03-04-2014', format='%d-%m-%Y') Timestamp('2014-04-03 00:00:00') Pandas was shown the intention by **format**, unfortunately date and time formats can vary from place to place, so the following is a summary of normally used values * %Y complete year (4 integers) * %m month, zero padded (2 integers) * %d day, zero padded (2 integers) * %H hour, 24 hour zero padded (2 integers) * %M minute, zero padded (2 integers) * %S second, zero padded (2 integers) The dates are separated by hyphens the times by full colons with a space between the date and time. Other formats can be used, see the `datetime table `_ for a full list of the many other possibilities. Start off with a dictionary containing some errors to correct:: # create a dataframe df = pd.DataFrame({ 'Date': ['2021-02-29','2021-04-02','03-04-2021','2021-04-04','2021-04-05'], 'Units Sold': [120, 123, 150, 160, 140] }) df Date Units Sold 0 2021-02-29 120 1 2021-04-02 123 2 03-04-2021 150 3 2021-04-04 160 4 2021-04-05 140 df.info() RangeIndex: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 5 non-null object 1 Units Sold 5 non-null int64 dtypes: int64(1), object(1) memory usage: 208.0+ bytes # change data type to datetime df['Date'] = pd.to_datetime(df['Date']) Creates a pile of error statements, finishing off with:: dateutil.parser._parser.ParserError: day is out of range for month: 2021-02-29 present at position 0 This detected the false date in the first line:: pd.to_datetime(df['Date'], errors='coerce').isnull().value_counts() False 4 True 1 Name: Date, dtype: int64 Confirms the finding, 4 not null values, 1 false line. With a small dataframe one can view and rectify easily. On a larger dataframe create a separate column with the corrected datetimes:: df['correct'] = pd.to_datetime(df['Date'],errors='coerce') df Date Units Sold correct 0 2021-02-29 120 NaT 1 2021-04-02 123 2021-04-02 2 03-04-2021 150 2021-03-04 3 2021-04-04 160 2021-04-04 4 2021-04-05 140 2021-04-05 Although the column **correct** has the right value **NaT** for the first line the change in date on the third line might have slipped in unnoticed. Add **format** to the date, then any differences will be hightlighted and can be tackled accordingly:: df['correct'] = pd.to_datetime(df['Date'], format='%Y-%m-%d',errors='coerce') df Date Units Sold correct 0 2021-02-29 120 NaT 1 2021-04-02 123 2021-04-02 2 03-04-2021 150 NaT 3 2021-04-04 160 2021-04-04 4 2021-04-05 140 2021-04-05 In a larger dataframe it is impractical to find the errors by inspection, use a query to find the errors:: df[df['correct'].isna()] Date Units Sold correct 0 2021-02-29 120 NaT 2 03-04-2021 150 NaT Now the index numbers are known it is straightforward to change the **Date** to correct the errors:: df.at[0, 'Date'] = '2021-02-28' df.at[2, 'Date'] = '2021-04-03' df Date Units Sold correct 0 2021-02-28 120 NaT 1 2021-04-02 123 2021-04-02 2 2021-04-03 150 NaT 3 2021-04-04 160 2021-04-04 4 2021-04-05 140 2021-04-05 df['correct'] = pd.to_datetime(df['Date'], format='%Y-%m-%d',errors='coerce') df Date Units Sold correct 0 2021-02-28 120 2021-02-28 1 2021-04-02 123 2021-04-02 2 2021-04-03 150 2021-04-03 3 2021-04-04 160 2021-04-04 4 2021-04-05 140 2021-04-05 The dataframe is now ready to accept the dtype conversion of the Date column:: df['Date'] = pd.to_datetime(df['Date']) df.info() RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 5 non-null datetime64[ns] 1 Units Sold 5 non-null int64 2 correct 5 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(1) memory usage: 248.0 bytes The dtype for datetime is **datetime64(ns)**, where ns refers to nanoseconds. Now the dataframe can be cleaned up by removing the **correct** column and saving. Currency Symbols ^^^^^^^^^^^^^^^^ Pandas does not handle currency signs, so where necessary replace the local currency symbol with an empty value:: data = {'Cost': ['€100', '€150', '€200', '€250'], 'Item': ['a', 'b', 'c', 'd']} dferr = pd.DataFrame(data) dferr Cost Item 0 €100 a 1 €150 b 2 €200 c 3 €250 d dferr.dtypes Cost object Item object dtype: object dferr['Cost'] = dferr['Cost'].str.replace('€','') dferr Cost Item 0 100 a 1 150 b 2 200 c 3 250 d dferr.dtypes Cost object Item object dtype: object The **Cost** column is still an object, convert to integer:: dferr['Cost'] = dferr['Cost'].astype(int) dferr.dtypes Cost int32 Item object dtype: object .. note:: astype() Only specify int, not int32, float is similar only use float not float64. Thousand Delimiters ^^^^^^^^^^^^^^^^^^^^ Columns that show integers might have various delimiters to show thousands. It is easiest to remove delimiters and only show when displaying. Integers are hard coded not to allow delimiters. Under normal circumstances loading data directly with a dictionary does not allow delimiters. Using a similar method to datetime, check whether the column has any disallowed numeric values. This time 'Price' shows as an object:: df = pd.read_csv("file4.csv", sep=";") df Product Price 0 Tablet 250.00 1 Printer 105.00 2 Laptop 1,200.00 3 UHD Screen 400.00 df.dtypes Product object Price object print(pd.to_numeric(df['Price'], errors='raise')) ... ValueError: Unable to parse string "1,200.00" ... ValueError: Unable to parse string "1,200.00" at position 2 An error is produced together with its position - we already know it's in the column 'Price':: df['Price'] = df['Price'].str.replace(',','') df['Price'] = df['Price'].astype(float) df.dtypes Product object Price float64 Categories ^^^^^^^^^^ Check for columns with a few known values - maybe these columns are not so relevant and can be deleted. With a large dataframe find these columns, and check that all values are in place, like the following small example. First load the dataframe, show it and then call its info:: dfb = pd.read_csv("beer_list.csv", sep=";") dfb Beer Type Ferment Type OE°P ... FP°C TMD°C Gravity 0 Topvar 10 bottom 10.0 ... -2.11 2.94 P 1 Topvar 12 bottom 12.0 ... -2.49 2.68 P 2 Pilsner Urquelle 10 bottom 10.0 ... -2.04 2.69 P 3 Pilsner Urquelle 12 bottom 12.0 ... -2.36 2.18 P 4 Zlaty Bazant 10 bottom 10.0 ... -2.04 2.69 P 5 Heavy Czech bottom 13.6 ... -2.40 1.03 P 6 Bitter 1 top 1030.9 ... -1.51 2.60 B 7 Bitter 2 top 1045.3 ... -2.19 2.20 B 8 Mild 1 top 1030.7 ... -1.35 2.02 B 9 Mild 2 top 1036.5 ... -1.77 2.42 NaN 10 Zlaty Bazant 12 bottom 12.0 ... -2.36 2.18 NaN [11 rows x 13 columns] dfb.info() RangeIndex: 11 entries, 0 to 10 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Beer Type 11 non-null object 1 Ferment Type 11 non-null object 2 OE°P 11 non-null float64 3 AbV% 11 non-null float64 4 AbW% 11 non-null float64 5 DoF% 11 non-null float64 6 TE°P 11 non-null float64 7 AE°P 11 non-null float64 8 Wort Density kg/m³ 11 non-null float64 9 Beer Density kg/m³ 11 non-null float64 10 FP°C 11 non-null float64 11 TMD°C 11 non-null float64 12 Gravity 9 non-null object dtypes: float64(10), object(3) memory usage: 1.2+ KB dfb['Ferment Type'].nunique() 2 dfb['Ferment Type'].value_counts() bottom 7 top 4 Name: Ferment Type, dtype: int64 In the column **Ferment Type** the count of unique entries is 2 (function ``nunique()``), value_counts totals 11 which equals the number of rows, so this row requires no adjustment. Try the same with the column **Gravity**:: dfb['Gravity'].nunique() 2 dfb['Gravity'].value_counts() P 6 B 3 Name: Gravity, dtype: int64 dfb['Gravity'].isnull().values.any() True dfb['Gravity'].isnull().sum() 2 dfb['Gravity'].isnull() 0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 True 10 True Name: Gravity, dtype: bool Once again in **Gravity** column the count of unique entries is 2, but the value_counts are 6 and 3, only 9 rows filled out of 11. The **Beer Type** is not alphabetical so use the index to find the row. The previous 3 commands confirm what we already know, the last, **dfb['Gravity'].isnull()**, is closer to what we require, but is impractical for a larger dataframe:: dfb[dfb['Gravity'].isna()] Beer Type Ferment Type OE°P ... FP°C TMD°C Gravity 9 Mild 2 top 1036.5 ... -1.77 2.42 NaN 10 Zlaty Bazant 12 bottom 12.0 ... -2.36 2.18 NaN [2 rows x 13 columns] That's better! We now have just the rows where there are empty inputs in the target column. In case you were wondering there are 13 columns, hidden columns are not as relevant as the index, name and the target column. When constructing a dataframe it is important to have the columns with names first, close to the index. Looking at row 9, 'Gravity' we have a value of "NaN":: dfb.at[9, 'Gravity'] NaN as does the next row. Since row 9 is 'Beer Type' Mild 2 Gravity would be in British units, and row 10 is Zlaty Bazant 12 the Gravity would be in P. Update the values. dfb.at[9, 'Gravity'] = 'B' dfb.at[10, 'Gravity'] = 'P' .. sidebar:: Alternative Index If Beer Type had unique values then it could have been made into an index in which case our modified commands would have been:: dfb.at['Mild 2', 'Gravity'] = 'B' dfb.at['Zlaty Bazant 12', 'Gravity'] = 'P' When working with single cells use **at** or **iat**, these are similar to **loc** and **iloc** when working with rows or columns. There is still work to be done on the columns **Ferment Type** and **Gravity** even after all empty inputs are rectified. These columns have limited values and we can let pandas autodetect all these values in declared columns:: dfb["Ferment Type"] = pd.Categorical(dfb["Ferment Type"]) dfb["Gravity"] = pd.Categorical(dfb["Gravity"]) Check on what changes if any have occurred:: dfb.info() RangeIndex: 11 entries, 0 to 10 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Beer Type 11 non-null object 1 Ferment Type 11 non-null category 2 OE°P 11 non-null float64 3 AbV% 11 non-null float64 4 AbW% 11 non-null float64 5 DoF% 11 non-null float64 6 TE°P 11 non-null float64 7 AE°P 11 non-null float64 8 Wort Density kg/m³ 11 non-null float64 9 Beer Density kg/m³ 11 non-null float64 10 FP°C 11 non-null float64 11 TMD°C 11 non-null float64 12 Gravity 11 non-null category dtypes: category(2), float64(10), object(1) memory usage: 1.3+ KB The non-null count in Gravity has been eliminated and the column type has changed from object to category. On a larger dataframe memory savings would also be seen:: dfb['Beer Type'].dtype dtype('O') dfb['Ferment Type'].dtype CategoricalDtype(categories=['bottom', 'top'], ordered=False) dfb['Gravity'].dtype CategoricalDtype(categories=['B', 'P'], ordered=False) Smaller Types ^^^^^^^^^^^^^^ Often the numerical data type is stored in a much larger format than necessary. In the dataframe dfb there are 10 numeric columns stored as float64, in this instance the storage type can be changed without any loss:: float64_cols = list(dfb.select_dtypes(include='float64')) dfb[float64_cols] = dfb[float64_cols].astype('float32') alternatively:: import numpy as np dfb[dfb.select_dtypes(np.float64).columns] = dfb.select_dtypes(np.float64).astype(np.float32) dfb.info() RangeIndex: 11 entries, 0 to 10 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Beer Type 11 non-null object 1 Ferment Type 11 non-null category 2 OE°P 11 non-null float32 3 AbV% 11 non-null float32 4 AbW% 11 non-null float32 5 DoF% 11 non-null float32 6 TE°P 11 non-null float32 7 AE°P 11 non-null float32 8 Wort Density kg/m³ 11 non-null float32 9 Beer Density kg/m³ 11 non-null float32 10 FP°C 11 non-null float32 11 TMD°C 11 non-null float32 12 Gravity 11 non-null category dtypes: category(2), float32(10), object(1) memory usage: 926.0+ bytes The memory use has gone down. A similar method can be used to change int64 to int32. Invalid Data ^^^^^^^^^^^^ Continuing with data cleanup, invalid data is even more of a headache than non-existing data. If the data does not exist it can be readily detected and updated, replaced or dropped. First where we have columns with the wrong type has been inserted should be straightforward in the Categorical types. Help with numeric data comes with statistical information, first set the display output to show all the rows:: pd.set_option('display.max_columns', None) dfb.describe() OE°P AbV% AbW% DoF% TE°P AE°P \ count 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 mean 383.909091 4.284545 3.419091 64.570909 371.301818 368.437273 std 516.889786 0.901902 0.723194 5.812398 509.667698 508.073459 min 10.000000 2.500000 1.990000 52.120000 2.960000 1.340000 25% 11.000000 3.950000 3.150000 62.435000 3.455000 1.730000 50% 12.000000 4.500000 3.590000 65.870000 4.270000 2.480000 75% 1030.800000 4.965000 3.965000 68.360000 1012.800000 1008.100000 max 1045.300000 5.400000 4.320000 71.490000 1016.200000 1011.300000 Wort Density kg/m³ Beer Density kg/m³ FP°C TMD°C count 11.000000 11.000000 11.000000 11.000000 mean 946.946364 917.256364 -2.056364 2.330000 std 313.806788 303.900661 0.372995 0.517958 min 1.050000 1.010000 -2.490000 1.030000 25% 1033.700000 1006.665000 -2.360000 2.180000 50% 1040.030000 1007.640000 -2.110000 2.420000 75% 1046.835000 1009.670000 -1.905000 2.685000 max 1055.130000 1016.730000 -1.350000 2.940000 Obviously British and Continental units cannot be mixed, so the beer types should have been separated, as different dataframes or additional rows used, whichever is appropriate. .. sidebar:: Change the Display Limits The option for more columns has been already used:: pd.set_option('display.max_rows', None) pd.set_option('display.expand_frame_repr', False) pd.set_option('max_colwidth', -1) Splitting the Dataframe ^^^^^^^^^^^^^^^^^^^^^^^^ Update **Zlaty Bazant 12** to continental units. Remove the British beers, then the continental ones, save to separate pickle files:: dfc = dfb.drop(dfb[dfb.Gravity == 'B'].index) dfc Beer Type Ferment Type OE°P AbV% AbW% DoF% TE°P AE°P \ 0 Topvar 10 bottom 10.0 4.50 3.59 71.489998 2.96 1.34 1 Topvar 12 bottom 12.0 5.40 4.32 71.080002 3.63 1.69 2 Pilsner Urquelle 10 bottom 10.0 4.30 3.43 68.360001 3.28 1.73 3 Pilsner Urquelle 12 bottom 12.0 5.00 3.99 65.870003 4.27 2.48 4 Zlaty Bazant 10 bottom 10.0 4.30 3.43 68.360001 3.28 1.73 5 Heavy Czech bottom 13.6 4.93 3.94 57.450001 6.03 4.26 10 Zlaty Bazant 12 bottom 12.0 5.00 3.99 65.870003 4.27 2.48 Wort Density kg/m³ Beer Density kg/m³ FP°C TMD°C Gravity 0 1040.030029 1005.219971 -2.11 2.94 P 1 1048.369995 1006.590027 -2.49 2.68 P 2 1040.030029 1006.739990 -2.04 2.69 P 3 1048.369995 1009.690002 -2.36 2.18 P 4 1040.030029 1006.739990 -2.04 2.69 P 5 1055.130005 1016.729980 -2.40 1.03 P 10 1048.369995 1009.690002 -2.36 2.18 P dfc.to_pickle('beer_list_cont') dfb.drop(dfb[dfb.Gravity == 'P'].index, inplace=True) dfb Beer Type Ferment Type OE°P AbV% AbW% DoF% TE°P \ 6 Bitter 1 top 1030.900024 3.0 2.39 61.889999 1011.900024 7 Bitter 2 top 1045.300049 4.6 3.67 64.809998 1016.200012 8 Mild 1 top 1030.699951 2.5 1.99 52.119999 1014.799988 9 Mild 2 top 1036.500000 3.6 2.87 62.980000 1013.700012 AE°P Wort Density kg/m³ Beer Density kg/m³ FP°C TMD°C Gravity 6 1007.599976 1030.900024 1007.640015 -1.51 2.60 B 7 1009.599976 1045.300049 1009.650024 -2.19 2.20 B 8 1011.299988 1030.699951 1011.260010 -1.35 2.02 B 9 1008.599976 1036.500000 1008.549988 -1.77 2.42 B dfb.to_pickle('beer_list_brit') When retrieving the data from the pickle files note that the original indexing is in place, as are the float32 and categorized columns. If saved in csv then remember to switch off the index when saving, as a new index will be made when reloaded. Remember that columns will lose their categorisation and numbers will be the highest type.