Tuesday, October 26, 2021

Organizing data in Spreadsheets : Part-I

 Data Organisation in Spreadsheets (Part-I)

1. About Spreadsheets in general 

  • Congratulations on joining this journey of making better spreadsheets.
  • Spreadsheet is easy to use and that creates its own challenges.
  • Is the Spreadsheet formatted for Human Eyes?
  • Is the Spreadsheet formatted for a Computer? 
  • Is the Spreadsheet formatted for Both?

2. The answer is - Be Consistent…

-The first rule of data organization is be consistent. 

  • Whatever you do, do it consistently. Entering and organizing your data in a consistent way from the start will prevent you and your collaborators from having to spend time harmonizing the data later.
  • Use consistent codes for categorical variables. For a categorical variable like the mode of transport in a study of daily office commuters, use a single common value for private transport (e.g., “private”), and a single common value for public transport (e.g., “public”).

-Use a consistent fixed code for any missing values. 

  • Please have every cell filled in, so that one can distinguish between truly missing values and unintentionally missing values.
  • You could also use a hyphen.

-Use consistent variable names. 

  • If in one file (e.g., the first batch of subjects), you have a variable called “Public_Travel,” then call it exactly that in other files.

-Use a consistent data layout in multiple files. 

  • If your data are in multiple files and you use different layouts in different files, it will be extra work for the analyst to combine the files into one dataset for analysis.
  • With a consistent structure, it will be easy to automate this process.

-Use consistent file names. 

  • Have some system for naming files. If one file is called “Travel_batch1_2021-01-31.csv,” then do not call the file for the next batch “Travel2.csv” but rather use “Travel_batch2_2021-02-28.csv.
  • Keeping a consistent file naming scheme will help ensure that your files remain well organized, and it will make it easier to batch process the files if you need to.

-Use a consistent format for all dates, 

  • A format could be YYYY-MM-DD, for example, 2015-08-01. Or follow your local way of formatting dates.
  • If sometimes you write 8/1/2015 and sometimes 8-1-15, it will be more difficult to use the dates in analyses or data visualizations.

-Use consistent phrases in your notes. 

  • If you have a separate column of notes (e.g., “Personal Car” or “Bus”), be consistent in what you write. Do not sometimes write “Personal Car” and sometimes “Personal car,” or sometimes “Local Train” and sometimes “Local” or “Train”.

-Be careful about extra spaces within cells. 

  • A blank cell is different than a cell that contains a single space. And “Train” is different from “ Train ” (i.e., with spaces at the beginning and end). Similarly " Train" or "Train ". It has a space at the beginning and end respectively.

3. Further reading:

The links for the other two parts of this post.


Part-II


Part II post.


Part-III


Part III post.

No comments: