Thursday, March 12, 2020

multi-index (column) report using pandas

multi-index-09-03-2020 <!-- Custom stylesheet, it must be in the same directory as the html file
In [1]:
import pandas as pd
import numpy as np


#1 input file
input_file = r"D:\Python Scripts\multi_index_3.xlsx"

df1=pd.read_excel(input_file)
df1['Focus_area']=df1['Focus_area'].astype('category')
table0=df1
table0
#table0 is shown below.
Out[1]:
Focus_area City As_on Spend
0 Education Mumbai 2019-01-01 1000000
1 Healthcare Pune 2019-03-31 20000
2 Environment Chennai 2018-03-31 100000
3 Disaster Relief Bengaluru 2017-12-01 220000
4 Vocational Skills Varanasi 2018-09-30 540000
5 Education Pune 2019-06-30 3000000
6 Environment Pune 2019-03-31 666000
In [67]:
df1['As_on']=pd.to_datetime(df1.As_on)
In [68]:
df1.dtypes
Out[68]:
Focus_area          category
City                  object
As_on         datetime64[ns]
Spend                  int64
dtype: object
In [69]:
df1 = df1.set_index(['Focus_area','City'])
In [70]:
df1.dtypes
Out[70]:
As_on    datetime64[ns]
Spend             int64
dtype: object
In [4]:
df1.groupby(df1['As_on'])['Spend'].sum()
Out[4]:
As_on
2017-12-01     220000
2018-03-31     100000
2018-09-30     540000
2019-01-01    1000000
2019-03-31     686000
2019-06-30    3000000
Name: Spend, dtype: int64
In [5]:
df1['Quarter'] = pd.PeriodIndex(df1['As_on'], freq='Q-MAR').strftime('Q%q')#Adding quarter to the dataframe.

Q-DEC specifies quarterly periods whose last quarter ends on the last day in December.

Q-MAR specifies quarterly periods whose last quarter ends on the last day in March.

In [ ]:
df1
In [7]:
df1['Year']=df1.As_on.dt.year#Adding Year to the dataframe.
df1
Out[7]:
Focus_area City As_on Spend Quarter Year
0 Education Mumbai 2019-01-01 1000000 Q4 2019
1 Healthcare Pune 2019-03-31 20000 Q4 2019
2 Environment Chennai 2018-03-31 100000 Q4 2018
3 Disaster Relief Bengaluru 2017-12-01 220000 Q3 2017
4 Vocational Skills Varanasi 2018-09-30 540000 Q2 2018
5 Education Pune 2019-06-30 3000000 Q1 2019
6 Environment Pune 2019-03-31 666000 Q4 2019
In [12]:
df2 = pd.pivot_table(df1,index=['City','Focus_area'],columns=['Year','Quarter'],values=['Spend'],fill_value=0)#works
#df2.query('City==["Pune","Mumbai"]')#works
In [13]:
df2
Out[13]:
Spend
Year 2017 2018 2019
Quarter Q3 Q2 Q4 Q1 Q4
City Focus_area
Bengaluru Disaster Relief 220000 0 0 0 0
Chennai Environment 0 0 100000 0 0
Mumbai Education 0 0 0 0 1000000
Pune Education 0 0 0 3000000 0
Environment 0 0 0 0 666000
Healthcare 0 0 0 0 20000
Varanasi Vocational Skills 0 540000 0 0 0
In [14]:
df2 = pd.pivot_table(df1,index=['Year','Quarter'],columns=['City','Focus_area'],values=['Spend'],fill_value=0)#works
df2
Out[14]:
Spend
City Bengaluru Chennai Mumbai Pune Varanasi
Focus_area Disaster Relief Environment Education Education Environment Healthcare Vocational Skills
Year Quarter
2017 Q3 220000 0 0 0 0 0 0
2018 Q2 0 0 0 0 0 0 540000
Q4 0 100000 0 0 0 0 0
2019 Q1 0 0 0 3000000 0 0 0
Q4 0 0 1000000 0 666000 20000 0
In [ ]:
df2.stack(level=0)#works
#level=0 is the outermost level i.e. spend
#level=1 is the next level i.e. city
#level=2 is the innermost level i.e. Focus_area
In [22]:
df2.unstack(level=1)
#level=0 is the outermost level i.e. Year
#level=1 is the next level i.e. quarter
Out[22]:
Spend
City Bengaluru Chennai Mumbai ... Pune Varanasi
Focus_area Disaster Relief Environment Education ... Environment Healthcare Vocational Skills
Quarter Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 ... Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Year
2017 NaN NaN 220000.0 NaN NaN NaN 0.0 NaN NaN NaN ... 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN
2018 NaN 0.0 NaN 0.0 NaN 0.0 NaN 100000.0 NaN 0.0 ... NaN 0.0 NaN 0.0 NaN 0.0 NaN 540000.0 NaN 0.0
2019 0.0 NaN NaN 0.0 0.0 NaN NaN 0.0 0.0 NaN ... NaN 666000.0 0.0 NaN NaN 20000.0 0.0 NaN NaN 0.0
3 rows × 28 columns
In [ ]:
 
-->

Tuesday, March 10, 2020

CSR Multi-index report in python using pandas

I wanted to make a report with multi-index columns and rows using python and pandas. 

The starting point was data in Excel file. The data was brought in pandas data frame.

I wanted an output report that looked like the report shown below.

But there was one condition. The condition was the quarters of the year has to be as per Indian financial year that starts on 1st April and end on 31st March the following year. 

What helped me was the below syntax.

Q-DEC specifies quarterly periods whose last quarter ends on the last day in December. 

Q-MAR specifies quarterly periods whose last quarter ends on the last day in March.

So, I chose Q-MAR. 

The data shown is about spend on Social and Environmental efforts of a company in India. A similar report has to be published by the company, as applicable, as per Indian laws. 

The report format can also be extended to report on financial numbers. The data itself is imaginary. 

Monday, March 2, 2020

CSR Reporting using Power BI - multi-index or nested columns and rows report

I started learning Power BI a few days ago. I downloaded Power BI desktop.

What I realized is I can bring together data from different Excel files or Tables using Data Model, analyze it and report it. The totals are done by Power BI.

The time I had spent on Data Models in MS Access came in handy.   

I used Power Query to bring the files together from folder.

Below is my first multi-index or nested columns and rows report. Power BI makes it easy to report in this way. In Power BI this is a matrix visual and is interactive.

The data is not real. Defining Calendar Table helped. The Calendar represents Indian financial year and quarters.

The next step for me is to expand it to add multiple years, other locations and other focus areas.

Once I have these building blocks in place, then the next challenge is to develop a dashboard.

Wednesday, January 29, 2020

Excel Power Query - One to many and other way round


I had a list of unicorns downloaded from internet as shown below. The Cat1 is categories column. 

Figure 1

So, for every unicorn (Company), you have one or many select investors. 

Step 1:

Using power query in Excel 2016, I filtered the list to show Indian unicorns (16 of them). This was a fairly simple operation. 

Step 2:

Then again using power query I transformed the column containing select investors (many in one cell) in to multiple rows. I got a list of every Indian unicorn against the every select investor. The total number of rows grew to become 46.

Step 3:

Further I grouped all unicorn companies on select investors to know all the Indian unicorns they have invested. But for that I had to write a small custom function to combine all the unicorns as one list against every investor from select investor column. Then the list was put in column named Custom. Now we know each investor and Indian unicorns it is investing in as shown below. 

Figure 2

So, to conclude from unicorn company with its multiple investing firms; we now have investing firm and multiple unicorns companies it is investing in.  

This kind of requirement comes often with different contexts. I am new to power query and finding its usefulness as I could do the above in minutes. Next I want to do similar output using python. 


Monday, January 20, 2020

ET top 25 - further analysis

Further to below blog, I have added a few more plots. 

A) I have added a column to dataframe to indicate if the company belongs to public (pub) sector or private (pvt) sector. The below plot (similar to the earlier blogpost) shows the plot by types of companies. 

B) I also wanted to study the distribution, so I have added a histogram. From it, we can see that most of companies are below revenue level of Rs 2,00,000 Cr (USD 28.5billion).
C) I added one more calculated column in the dataframe to show profit percentage compared to revenue in the below graph. High profits percentage companies are in private sector and are IT companies namely TCS (only company with 20% plus profit ratio) and Infosys. The other two companies are banking firms namely HDFC Bank and HDFC. The top five companies by revenue are not that profitable by the ratio of profits to revenue and most of them are public sector companies. 
The data was analyzed using python, pandas and plotted using seaborn library.