I analyzed an excel containing a list of 300+ #unicorns using #Python and #Pandas. I made some nice charts also.
Later I realized that the column containing the classification values of unicorns such as TravelTech, EduTeach, Ecommerce had not been written consistently.
These similar looking classification values were written differently.
Ecommerce was written as eCommerce, ecommerce, e-commerce and so on. With these classification values my analysis wasn’t right. The grouping on classification values had given me incorrect analysis. These kinds of errors are common when no data validation is in place.
So started all over again. Just to describe in this post; I have taken the values and created a list.
The existing values are given below.
['Auto Tech', 'AutoTech', 'Digital health', 'Digital Health', 'EdTech', 'Edtech', 'Ed Tech', 'e-commerce', 'eCommerce', 'ecommerce', 'Food & Beverage', 'Food & Beverages', 'Food and Beverage', 'Health & Wellnes', 'Health & Wellness', 'IoT', 'Internet of Things', 'Sales Tech', 'SalesTech', 'On Demand', 'On-Demand', 'On-demand', 'Supply Chain & Logistics', 'Supply chain & Logistics', 'Travel Tech', 'TravelTech']
Using Python, I cleaned the list. I used #Spyder 4.0 which is beautiful. I used good old loops in the logic. I am comfortable with loops.
The new list is given below.
['Autotech', 'Autotech', 'Digitalhealth', 'Digitalhealth', 'Edtech', 'Edtech', 'Edtech', 'Ecommerce', 'Ecommerce', 'Ecommerce', 'Food&Beverages', 'Food&Beverages', 'Food&Beverages', 'Health&Wellness', 'Health&Wellness', 'Iot', 'Internetofthings', 'Salestech', 'Salestech', 'Ondemand', 'Ondemand', 'Ondemand', 'Supplychain&Logistics', 'Supplychain&Logistics', 'Traveltech', 'Traveltech']
The new cleaned list is now ready for analysis. All the classification values are written consistently.
However, there is one more iteration I have to do. IoT and ‘Internet of Things’ are shown separately.
I hope to take care of that as well shortly.