Monday, October 26, 2020

Economic Times Top 10 companies

ET_Top_10_WS1

To scrape the web,

to get the table,

of Economic Times

2019 Top 10 Companies

  • Let us download,
  • the necessary libraries.
In [43]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import lxml
import pandas as pd
from tabulate import tabulate
  • The link address is:link

Using URL open and
reading the URL using BeautifulSoup failed.
The error was
HTTP Error: Forbidden

  • So I tried the following.
In [ ]:
site= "https://economictimes.indiatimes.com/et500"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
  • Let us test

soup.title

The output was:

<title>ET 500 Company List - List of Top Companies in India 2019 | Economic Times ET500</title>

  • It worked.

*Find table rows

In [ ]:
rows = soup.find_all('tr')
print(rows[:3]) 

[<tr class="table_head"><th colspan="2">Rank</th><th rowspan="2">Company Name</th><th rowspan="2">Market Cap <br/> (Rs. Cr)</th><th rowspan="2">Revenue <br/> Change%</th><th rowspan="2">PAT <br/> Change%</th></tr>, <tr class="table_head"><th>2019</th><th>2018</th></tr>, <tr class="data light"><td class="Rnk1 textC">1</td><td class="Rnk2 textC">2</td><td class="comp"><a class="etcmpnylist" href="/stocks/companyid-13215.cms" target="_blank" title="Reliance Industries Ltd.">Reliance Industries Ltd.</a></td><td class="textR">949,280.41</td><td class="textR">43.36</td><td class="textR">9.74</td></tr>]

  • From Table rows let us look at table data items
In [ ]:
list_rows = []
for row in rows:
    row_td = row.find_all('td')
    str_cells = str(row_td)
    cleantext = BeautifulSoup(str_cells,'lxml').getText()
    list_rows.append(cleantext) # WORKS

list_rows

['[]',
 '[]',
 '[1, 2, Reliance Industries Ltd., 949,280.41, 43.36, 9.74]',
 '[2, 1, Indian Oil Corporation Ltd., 125,768.47, 24.99, -21.69]',
 '[3, 3, Oil And Natural Gas Corporation Ltd., 172,135.97, 29.97, 37.95]',
 '[4, 4, State Bank of India, 290,140.98, 7.88, LP]',
 '[5, 5, Tata Motors Ltd., 48,588.33, 2.06, PL]',
 '[6, 6, Bharat Petroleum Corporation Ltd., 111,550.97, 26.10, -13.39]',
 '[7, 7, Hindustan Petroleum Corporation Ltd., 45,237.26, 24.94, -7.31]',
 '[8, 8, Rajesh Exports Ltd., 20,200.00, -6.35, 2.08]',
 '[9, 9, Tata Steel Ltd., 45,758.52, 15.42, -23.94]',
 '[10, 10, Coal India Ltd., 126,223.50, 14.04, 148.09]']
  • Now let us prpare the dataframe
In [ ]:
df = pd.DataFrame(list_rows)
In [ ]:
df
In [ ]:
df1 = df.loc[2:]
In [ ]:
df2 = df1[0].str.split(',',expand=True)
In [ ]:
df2[0] = df2[0].str.strip('[')
df2[6] = df2[6].str.strip(']')
print(df2.head(3))
In [ ]:
df2[7]=df2[3]+df2[4] # Join 
df3 = df2.drop([3,4],axis=1)
df3 = df3[[0,1,2,7,5,6]] 

The column containing market capitalisation had a ',' in its numbers.

While splitting the data frame df1, the column with market capitalization also split inadvertently.

The number was split in to two columns 3 & 4.

So the two columns were joined in col no 7.

Later dropped columns 3 and 4. And rearranged columns sequence in df3.

In [ ]:
df3.head(3)
In [ ]:
df3_md = df3.to_markdown()
In [42]:
print(df3_md)
|    |   0 |   1 | 2                                    |        7 |     5 | 6      |
|---:|----:|----:|:-------------------------------------|---------:|------:|:-------|
|  2 |   1 |   2 | Reliance Industries Ltd.             | 949280   | 43.36 | 9.74   |
|  3 |   2 |   1 | Indian Oil Corporation Ltd.          | 125768   | 24.99 | -21.69 |
|  4 |   3 |   3 | Oil And Natural Gas Corporation Ltd. | 172136   | 29.97 | 37.95  |
|  5 |   4 |   4 | State Bank of India                  | 290141   |  7.88 | LP     |
|  6 |   5 |   5 | Tata Motors Ltd.                     |  48588.3 |  2.06 | PL     |
|  7 |   6 |   6 | Bharat Petroleum Corporation Ltd.    | 111551   | 26.1  | -13.39 |
|  8 |   7 |   7 | Hindustan Petroleum Corporation Ltd. |  45237.3 | 24.94 | -7.31  |
|  9 |   8 |   8 | Rajesh Exports Ltd.                  |  20200   | -6.35 | 2.08   |
| 10 |   9 |   9 | Tata Steel Ltd.                      |  45758.5 | 15.42 | -23.94 |
| 11 |  10 |  10 | Coal India Ltd.                      | 126224   | 14.04 | 148.09 |

Columns to be Renamed

In [ ]:
df3.rename(columns = {0:'Rank2019',
           1:'Rank2018',
           2:' Company Name',
           7:' Market Cap  (Rs. Cr)',
           5:' Revenue  Change%',
           6:' PAT  Change%'},
           inplace = True)
In [40]:
df3.head(3)
Out[40]:
Rank2019 Rank2018 Company Name Market Cap (Rs. Cr) Revenue Change% PAT Change%
2 1 2 Reliance Industries Ltd. 949280.41 43.36 9.74
3 2 1 Indian Oil Corporation Ltd. 125768.47 24.99 -21.69
4 3 3 Oil And Natural Gas Corporation Ltd. 172135.97 29.97 37.95

Now let us get the table header.

In [ ]:
col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str,'lxml').get_text()
all_header.append(cleantext2)
In [ ]:
dfh1 = pd.DataFrame(all_header)    
dfh2 = dfh1[0].str.split(',',expand=True)
In [ ]:
dfh2[0] = dfh2[0].str.strip('[')
dfh2[6] = dfh2[6].str.strip(']')
In [ ]:
dfh_a = dfh2.drop([5,6],axis=1) 
dfh_a.columns = dfh_a.iloc[0] # Add column names. 
dfh_a.insert(1,'Rank2018','Rank2018',True) # Add a column. 
dfh_a.rename(columns = {'Rank' : 'Rank2019'}, inplace = True)
In [ ]:
frames = [dfh_a,df3]
df4 = pd.concat(frames)
In [39]:
df5 = df4.drop(0)
df5
Out[39]:
Rank2019 Rank2018 Company Name Market Cap (Rs. Cr) Revenue Change% PAT Change%
2 1 2 Reliance Industries Ltd. 949280.41 43.36 9.74
3 2 1 Indian Oil Corporation Ltd. 125768.47 24.99 -21.69
4 3 3 Oil And Natural Gas Corporation Ltd. 172135.97 29.97 37.95
5 4 4 State Bank of India 290140.98 7.88 LP
6 5 5 Tata Motors Ltd. 48588.33 2.06 PL
7 6 6 Bharat Petroleum Corporation Ltd. 111550.97 26.10 -13.39
8 7 7 Hindustan Petroleum Corporation Ltd. 45237.26 24.94 -7.31
9 8 8 Rajesh Exports Ltd. 20200.00 -6.35 2.08
10 9 9 Tata Steel Ltd. 45758.52 15.42 -23.94
11 10 10 Coal India Ltd. 126223.50 14.04 148.09
In [ ]:
df5_md = df5.to_markdown()
In [41]:
print(df5_md)
|    |   Rank2019 |   Rank2018 |  Company Name                        |    Market Cap  (Rs. Cr) |    Revenue  Change% |  PAT  Change%   |
|---:|-----------:|-----------:|:-------------------------------------|------------------------:|--------------------:|:----------------|
|  2 |          1 |          2 | Reliance Industries Ltd.             |                949280   |               43.36 | 9.74            |
|  3 |          2 |          1 | Indian Oil Corporation Ltd.          |                125768   |               24.99 | -21.69          |
|  4 |          3 |          3 | Oil And Natural Gas Corporation Ltd. |                172136   |               29.97 | 37.95           |
|  5 |          4 |          4 | State Bank of India                  |                290141   |                7.88 | LP              |
|  6 |          5 |          5 | Tata Motors Ltd.                     |                 48588.3 |                2.06 | PL              |
|  7 |          6 |          6 | Bharat Petroleum Corporation Ltd.    |                111551   |               26.1  | -13.39          |
|  8 |          7 |          7 | Hindustan Petroleum Corporation Ltd. |                 45237.3 |               24.94 | -7.31           |
|  9 |          8 |          8 | Rajesh Exports Ltd.                  |                 20200   |               -6.35 | 2.08            |
| 10 |          9 |          9 | Tata Steel Ltd.                      |                 45758.5 |               15.42 | -23.94          |
| 11 |         10 |         10 | Coal India Ltd.                      |                126224   |               14.04 | 148.09          |

So we got the table we wanted

It may be done in more efficient ways, but this is my first such attempt on my own.

Sunday, October 18, 2020

Learning Python

Learning Python


I started by journey of learning Python from Eric Matthes. The name of the book is Python Crash Course. It give me the necessary start. I was stuck and asked him a query. He responded immediately.
Later I found Python Data Science Handbook very useful. It is written by Jake VanderPlas.
To learn pandas, I bought a book by Daniel Y. Chen. The name of the book is Pandas for Everyone.
Recently I found the articles by Kimberley Fessel to be very useful. I learned about her excellent articles through a padcast. If I am not mistaken I heard Daniel Y. Chen on the same podcast channel.
All of the above are very well-know. But still I hope this information helps.

Saturday, July 25, 2020

Where are IITs located?

With new Indian Institues of Technology (IITs) set up; I was keen to know their locations. 

Here is plot that shows their locations. 

I created the dataset. Using pandas and geopands I plotted this figure. 

It was good learning. I could set their colors differently as I wanted. 

This was intended but the map shows an interesting pattern. Most of the IITs are in Northern India. 

Tuesday, July 21, 2020

plot showing population of Rajasthan India districtwise using geopandas

A geopandas chart showing population of Rajasthan India district wise
A geopandas chart showing population of Rajasthan India district wise
I had the shape file of state of Rajasthan in India. The next step was to create a  virtual environment for geo Pandas on my Anaconda distribution on windows laptop. 

I did that. 

Once I had the set up and data, I plotted the population of the state. Geo Pandas made it really easy to plot the geo data and associated properties. 

I am looking forward to working on more geo Pandas plots. 


Sunday, July 19, 2020

World population map with GeoPandas

World population map with GeoPandas


Today I had decided to plot my first plot using GeoPandas; and I am glad I could do it today.

Below is the plot of world-population. But the plot is not that important.

What is important is how I got to this point. After attaining some momentum with pandas almost six months ago; I could not do much with GeoPandas. The reason was when I downloaded GeoPandas; it also loaded it dependencies and disturbed by existing packages in Anaconda distribution on my Windows laptop. So I had to remove GeoPandas.

The solution, I understood, needed multiple separate environments. That is done by installing virtual environments. Here was where I lost the momentum.

The options to install virtual environments are many; and that needed me to understood what was relevant for me. Finally I figured that out today.

I opted to install the needed virtual environment using Anaconda GUI. I loaded the necessary packages. I was not clear what happens to the IDE. It also installed on its own.

I now hope to build on what I have learned today!


World population map with GeoPandas
World population map with GeoPandas