Monday, October 26, 2020

Economic Times Top 10 companies

ET_Top_10_WS1

To scrape the web,

to get the table,

of Economic Times

2019 Top 10 Companies

  • Let us download,
  • the necessary libraries.
In [43]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import lxml
import pandas as pd
from tabulate import tabulate
  • The link address is:link

Using URL open and
reading the URL using BeautifulSoup failed.
The error was
HTTP Error: Forbidden

  • So I tried the following.
In [ ]:
site= "https://economictimes.indiatimes.com/et500"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
  • Let us test

soup.title

The output was:

<title>ET 500 Company List - List of Top Companies in India 2019 | Economic Times ET500</title>

  • It worked.

*Find table rows

In [ ]:
rows = soup.find_all('tr')
print(rows[:3]) 

[<tr class="table_head"><th colspan="2">Rank</th><th rowspan="2">Company Name</th><th rowspan="2">Market Cap <br/> (Rs. Cr)</th><th rowspan="2">Revenue <br/> Change%</th><th rowspan="2">PAT <br/> Change%</th></tr>, <tr class="table_head"><th>2019</th><th>2018</th></tr>, <tr class="data light"><td class="Rnk1 textC">1</td><td class="Rnk2 textC">2</td><td class="comp"><a class="etcmpnylist" href="/stocks/companyid-13215.cms" target="_blank" title="Reliance Industries Ltd.">Reliance Industries Ltd.</a></td><td class="textR">949,280.41</td><td class="textR">43.36</td><td class="textR">9.74</td></tr>]

  • From Table rows let us look at table data items
In [ ]:
list_rows = []
for row in rows:
    row_td = row.find_all('td')
    str_cells = str(row_td)
    cleantext = BeautifulSoup(str_cells,'lxml').getText()
    list_rows.append(cleantext) # WORKS

list_rows

['[]',
 '[]',
 '[1, 2, Reliance Industries Ltd., 949,280.41, 43.36, 9.74]',
 '[2, 1, Indian Oil Corporation Ltd., 125,768.47, 24.99, -21.69]',
 '[3, 3, Oil And Natural Gas Corporation Ltd., 172,135.97, 29.97, 37.95]',
 '[4, 4, State Bank of India, 290,140.98, 7.88, LP]',
 '[5, 5, Tata Motors Ltd., 48,588.33, 2.06, PL]',
 '[6, 6, Bharat Petroleum Corporation Ltd., 111,550.97, 26.10, -13.39]',
 '[7, 7, Hindustan Petroleum Corporation Ltd., 45,237.26, 24.94, -7.31]',
 '[8, 8, Rajesh Exports Ltd., 20,200.00, -6.35, 2.08]',
 '[9, 9, Tata Steel Ltd., 45,758.52, 15.42, -23.94]',
 '[10, 10, Coal India Ltd., 126,223.50, 14.04, 148.09]']
  • Now let us prpare the dataframe
In [ ]:
df = pd.DataFrame(list_rows)
In [ ]:
df
In [ ]:
df1 = df.loc[2:]
In [ ]:
df2 = df1[0].str.split(',',expand=True)
In [ ]:
df2[0] = df2[0].str.strip('[')
df2[6] = df2[6].str.strip(']')
print(df2.head(3))
In [ ]:
df2[7]=df2[3]+df2[4] # Join 
df3 = df2.drop([3,4],axis=1)
df3 = df3[[0,1,2,7,5,6]] 

The column containing market capitalisation had a ',' in its numbers.

While splitting the data frame df1, the column with market capitalization also split inadvertently.

The number was split in to two columns 3 & 4.

So the two columns were joined in col no 7.

Later dropped columns 3 and 4. And rearranged columns sequence in df3.

In [ ]:
df3.head(3)
In [ ]:
df3_md = df3.to_markdown()
In [42]:
print(df3_md)
|    |   0 |   1 | 2                                    |        7 |     5 | 6      |
|---:|----:|----:|:-------------------------------------|---------:|------:|:-------|
|  2 |   1 |   2 | Reliance Industries Ltd.             | 949280   | 43.36 | 9.74   |
|  3 |   2 |   1 | Indian Oil Corporation Ltd.          | 125768   | 24.99 | -21.69 |
|  4 |   3 |   3 | Oil And Natural Gas Corporation Ltd. | 172136   | 29.97 | 37.95  |
|  5 |   4 |   4 | State Bank of India                  | 290141   |  7.88 | LP     |
|  6 |   5 |   5 | Tata Motors Ltd.                     |  48588.3 |  2.06 | PL     |
|  7 |   6 |   6 | Bharat Petroleum Corporation Ltd.    | 111551   | 26.1  | -13.39 |
|  8 |   7 |   7 | Hindustan Petroleum Corporation Ltd. |  45237.3 | 24.94 | -7.31  |
|  9 |   8 |   8 | Rajesh Exports Ltd.                  |  20200   | -6.35 | 2.08   |
| 10 |   9 |   9 | Tata Steel Ltd.                      |  45758.5 | 15.42 | -23.94 |
| 11 |  10 |  10 | Coal India Ltd.                      | 126224   | 14.04 | 148.09 |

Columns to be Renamed

In [ ]:
df3.rename(columns = {0:'Rank2019',
           1:'Rank2018',
           2:' Company Name',
           7:' Market Cap  (Rs. Cr)',
           5:' Revenue  Change%',
           6:' PAT  Change%'},
           inplace = True)
In [40]:
df3.head(3)
Out[40]:
Rank2019 Rank2018 Company Name Market Cap (Rs. Cr) Revenue Change% PAT Change%
2 1 2 Reliance Industries Ltd. 949280.41 43.36 9.74
3 2 1 Indian Oil Corporation Ltd. 125768.47 24.99 -21.69
4 3 3 Oil And Natural Gas Corporation Ltd. 172135.97 29.97 37.95

Now let us get the table header.

In [ ]:
col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str,'lxml').get_text()
all_header.append(cleantext2)
In [ ]:
dfh1 = pd.DataFrame(all_header)    
dfh2 = dfh1[0].str.split(',',expand=True)
In [ ]:
dfh2[0] = dfh2[0].str.strip('[')
dfh2[6] = dfh2[6].str.strip(']')
In [ ]:
dfh_a = dfh2.drop([5,6],axis=1) 
dfh_a.columns = dfh_a.iloc[0] # Add column names. 
dfh_a.insert(1,'Rank2018','Rank2018',True) # Add a column. 
dfh_a.rename(columns = {'Rank' : 'Rank2019'}, inplace = True)
In [ ]:
frames = [dfh_a,df3]
df4 = pd.concat(frames)
In [39]:
df5 = df4.drop(0)
df5
Out[39]:
Rank2019 Rank2018 Company Name Market Cap (Rs. Cr) Revenue Change% PAT Change%
2 1 2 Reliance Industries Ltd. 949280.41 43.36 9.74
3 2 1 Indian Oil Corporation Ltd. 125768.47 24.99 -21.69
4 3 3 Oil And Natural Gas Corporation Ltd. 172135.97 29.97 37.95
5 4 4 State Bank of India 290140.98 7.88 LP
6 5 5 Tata Motors Ltd. 48588.33 2.06 PL
7 6 6 Bharat Petroleum Corporation Ltd. 111550.97 26.10 -13.39
8 7 7 Hindustan Petroleum Corporation Ltd. 45237.26 24.94 -7.31
9 8 8 Rajesh Exports Ltd. 20200.00 -6.35 2.08
10 9 9 Tata Steel Ltd. 45758.52 15.42 -23.94
11 10 10 Coal India Ltd. 126223.50 14.04 148.09
In [ ]:
df5_md = df5.to_markdown()
In [41]:
print(df5_md)
|    |   Rank2019 |   Rank2018 |  Company Name                        |    Market Cap  (Rs. Cr) |    Revenue  Change% |  PAT  Change%   |
|---:|-----------:|-----------:|:-------------------------------------|------------------------:|--------------------:|:----------------|
|  2 |          1 |          2 | Reliance Industries Ltd.             |                949280   |               43.36 | 9.74            |
|  3 |          2 |          1 | Indian Oil Corporation Ltd.          |                125768   |               24.99 | -21.69          |
|  4 |          3 |          3 | Oil And Natural Gas Corporation Ltd. |                172136   |               29.97 | 37.95           |
|  5 |          4 |          4 | State Bank of India                  |                290141   |                7.88 | LP              |
|  6 |          5 |          5 | Tata Motors Ltd.                     |                 48588.3 |                2.06 | PL              |
|  7 |          6 |          6 | Bharat Petroleum Corporation Ltd.    |                111551   |               26.1  | -13.39          |
|  8 |          7 |          7 | Hindustan Petroleum Corporation Ltd. |                 45237.3 |               24.94 | -7.31           |
|  9 |          8 |          8 | Rajesh Exports Ltd.                  |                 20200   |               -6.35 | 2.08            |
| 10 |          9 |          9 | Tata Steel Ltd.                      |                 45758.5 |               15.42 | -23.94          |
| 11 |         10 |         10 | Coal India Ltd.                      |                126224   |               14.04 | 148.09          |

So we got the table we wanted

It may be done in more efficient ways, but this is my first such attempt on my own.

Sunday, October 18, 2020

Learning Python

Learning Python


I started by journey of learning Python from Eric Matthes. The name of the book is Python Crash Course. It give me the necessary start. I was stuck and asked him a query. He responded immediately.
Later I found Python Data Science Handbook very useful. It is written by Jake VanderPlas.
To learn pandas, I bought a book by Daniel Y. Chen. The name of the book is Pandas for Everyone.
Recently I found the articles by Kimberley Fessel to be very useful. I learned about her excellent articles through a padcast. If I am not mistaken I heard Daniel Y. Chen on the same podcast channel.
All of the above are very well-know. But still I hope this information helps.

Saturday, July 25, 2020

Where are IITs located?

With new Indian Institues of Technology (IITs) set up; I was keen to know their locations. 

Here is plot that shows their locations. 

I created the dataset. Using pandas and geopands I plotted this figure. 

It was good learning. I could set their colors differently as I wanted. 

This was intended but the map shows an interesting pattern. Most of the IITs are in Northern India. 

Tuesday, July 21, 2020

plot showing population of Rajasthan India districtwise using geopandas

A geopandas chart showing population of Rajasthan India district wise
A geopandas chart showing population of Rajasthan India district wise
I had the shape file of state of Rajasthan in India. The next step was to create a  virtual environment for geo Pandas on my Anaconda distribution on windows laptop. 

I did that. 

Once I had the set up and data, I plotted the population of the state. Geo Pandas made it really easy to plot the geo data and associated properties. 

I am looking forward to working on more geo Pandas plots. 


Sunday, July 19, 2020

World population map with GeoPandas

World population map with GeoPandas


Today I had decided to plot my first plot using GeoPandas; and I am glad I could do it today.

Below is the plot of world-population. But the plot is not that important.

What is important is how I got to this point. After attaining some momentum with pandas almost six months ago; I could not do much with GeoPandas. The reason was when I downloaded GeoPandas; it also loaded it dependencies and disturbed by existing packages in Anaconda distribution on my Windows laptop. So I had to remove GeoPandas.

The solution, I understood, needed multiple separate environments. That is done by installing virtual environments. Here was where I lost the momentum.

The options to install virtual environments are many; and that needed me to understood what was relevant for me. Finally I figured that out today.

I opted to install the needed virtual environment using Anaconda GUI. I loaded the necessary packages. I was not clear what happens to the IDE. It also installed on its own.

I now hope to build on what I have learned today!


World population map with GeoPandas
World population map with GeoPandas

Thursday, May 21, 2020

Ubuntu

The benefits of Ubuntu Linux distribution

I bought a new laptop with Ubuntu distribution in September of 2015. I had decided to learn good useful software packages available under it.  

I learned, from beginning, Python 2.7 on it in 2nd half of 2017. Then it was not used for a while. The reason is I bought Windows laptop to learn Python3. And use Excel. I added Power Query and Power BI to it.

Back to Ubuntu distribution. 

But I was determined to use my Ubuntu laptop. So during COVID-19 I started using the functionality available at command prompt. 

Soon I could do copying of files, moving files through simple commands at the prompt. 

I had lots of files, photos on internal drive, external drive, on media card and USB pen drives. I moved all my files in to designated folders. The software has a GUI like windows. But moving files using commands at the prompt is much easier. Also you could set some conditions such copy only newer or updated files. 

I had read about it, but now i experienced it. 

As a next step I learned to write shell scripts to do files management through shell scripts. 

The next part was learning the text processing ability of Linux. Text processing becomes easy with grep, sed, and awk. Regular Expressions is also one has to learn to use these three commands. 

With awk one could work on data with columns and rows and make analysis and prepare reports. 

Here also you could write scripts to automate these tasks. I could write scripts in shell for grep, sed, and awk.

I used pdfgrep to read pdf files. I downloaded pdfgrep. 

I had to learn how to download,  update my packages from central servers using command prompt. Learning to read help files is part of it. 

The power and flexibility of these commands is immense. 

The next stage is to use it for the real world applications. In fact I learned only to work on real world applications. 

I am happy with my progress so far and hope to put it to some good use. 

Friday, May 1, 2020

State-wise electricity consumption - India

State-wise electricity consumption - India
State-wise electricity consumption - India
I had prepared this report almost a month ago. I am posting it today. It was part of my learning Power BI reports. The data was taken from wiki (internet).

Maharashtra has the highest consumption of electricity followed by Namil Nadu and Gujarat. It is followed by Karnataka and Central India (UP, MP and Rajasthan).

If electricity consumption is proxy to the the progress of a state then the chart clearly tells you the story.