To scrape the web,¶

to get the table,¶

of Economic Times¶

2019 Top 10 Companies¶

Let us download,
the necessary libraries.

from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import lxml
import pandas as pd
from tabulate import tabulate

The link address is:link

Using URL open and
reading the URL using BeautifulSoup failed.
The error was
HTTP Error: Forbidden

So I tried the following.

site= "https://economictimes.indiatimes.com/et500"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)

Let us test

soup.title

The output was:

<title>ET 500 Company List - List of Top Companies in India 2019 | Economic Times ET500</title>

It worked.

*Find table rows

rows = soup.find_all('tr')
print(rows[:3])

[<tr class="table_head"><th colspan="2">Rank</th><th rowspan="2">Company Name</th><th rowspan="2">Market Cap <br/> (Rs. Cr)</th><th rowspan="2">Revenue <br/> Change%</th><th rowspan="2">PAT <br/> Change%</th></tr>, <tr class="table_head"><th>2019</th><th>2018</th></tr>, <tr class="data light"><td class="Rnk1 textC">1</td><td class="Rnk2 textC">2</td><td class="comp"><a class="etcmpnylist" href="/stocks/companyid-13215.cms" target="_blank" title="Reliance Industries Ltd.">Reliance Industries Ltd.</a></td><td class="textR">949,280.41</td><td class="textR">43.36</td><td class="textR">9.74</td></tr>]

From Table rows let us look at table data items

list_rows = []
for row in rows:
    row_td = row.find_all('td')
    str_cells = str(row_td)
    cleantext = BeautifulSoup(str_cells,'lxml').getText()
    list_rows.append(cleantext) # WORKS

list_rows

['[]',
 '[]',
 '[1, 2, Reliance Industries Ltd., 949,280.41, 43.36, 9.74]',
 '[2, 1, Indian Oil Corporation Ltd., 125,768.47, 24.99, -21.69]',
 '[3, 3, Oil And Natural Gas Corporation Ltd., 172,135.97, 29.97, 37.95]',
 '[4, 4, State Bank of India, 290,140.98, 7.88, LP]',
 '[5, 5, Tata Motors Ltd., 48,588.33, 2.06, PL]',
 '[6, 6, Bharat Petroleum Corporation Ltd., 111,550.97, 26.10, -13.39]',
 '[7, 7, Hindustan Petroleum Corporation Ltd., 45,237.26, 24.94, -7.31]',
 '[8, 8, Rajesh Exports Ltd., 20,200.00, -6.35, 2.08]',
 '[9, 9, Tata Steel Ltd., 45,758.52, 15.42, -23.94]',
 '[10, 10, Coal India Ltd., 126,223.50, 14.04, 148.09]']

Now let us prpare the dataframe

df = pd.DataFrame(list_rows)

df

df1 = df.loc[2:]

df2 = df1[0].str.split(',',expand=True)

df2[0] = df2[0].str.strip('[')
df2[6] = df2[6].str.strip(']')
print(df2.head(3))

df2[7]=df2[3]+df2[4] # Join 
df3 = df2.drop([3,4],axis=1)
df3 = df3[[0,1,2,7,5,6]]

The column containing market capitalisation had a ',' in its numbers.

While splitting the data frame df1, the column with market capitalization also split inadvertently.

The number was split in to two columns 3 & 4.

So the two columns were joined in col no 7.

Later dropped columns 3 and 4. And rearranged columns sequence in df3.

df3.head(3)

df3_md = df3.to_markdown()

print(df3_md)

|    |   0 |   1 | 2                                    |        7 |     5 | 6      |
|---:|----:|----:|:-------------------------------------|---------:|------:|:-------|
|  2 |   1 |   2 | Reliance Industries Ltd.             | 949280   | 43.36 | 9.74   |
|  3 |   2 |   1 | Indian Oil Corporation Ltd.          | 125768   | 24.99 | -21.69 |
|  4 |   3 |   3 | Oil And Natural Gas Corporation Ltd. | 172136   | 29.97 | 37.95  |
|  5 |   4 |   4 | State Bank of India                  | 290141   |  7.88 | LP     |
|  6 |   5 |   5 | Tata Motors Ltd.                     |  48588.3 |  2.06 | PL     |
|  7 |   6 |   6 | Bharat Petroleum Corporation Ltd.    | 111551   | 26.1  | -13.39 |
|  8 |   7 |   7 | Hindustan Petroleum Corporation Ltd. |  45237.3 | 24.94 | -7.31  |
|  9 |   8 |   8 | Rajesh Exports Ltd.                  |  20200   | -6.35 | 2.08   |
| 10 |   9 |   9 | Tata Steel Ltd.                      |  45758.5 | 15.42 | -23.94 |
| 11 |  10 |  10 | Coal India Ltd.                      | 126224   | 14.04 | 148.09 |

Columns to be Renamed

df3.rename(columns = {0:'Rank2019',
           1:'Rank2018',
           2:' Company Name',
           7:' Market Cap  (Rs. Cr)',
           5:' Revenue  Change%',
           6:' PAT  Change%'},
           inplace = True)

df3.head(3)

Now let us get the table header.

col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str,'lxml').get_text()
all_header.append(cleantext2)

dfh1 = pd.DataFrame(all_header)    
dfh2 = dfh1[0].str.split(',',expand=True)

dfh2[0] = dfh2[0].str.strip('[')
dfh2[6] = dfh2[6].str.strip(']')

dfh_a = dfh2.drop([5,6],axis=1) 
dfh_a.columns = dfh_a.iloc[0] # Add column names. 
dfh_a.insert(1,'Rank2018','Rank2018',True) # Add a column. 
dfh_a.rename(columns = {'Rank' : 'Rank2019'}, inplace = True)

frames = [dfh_a,df3]
df4 = pd.concat(frames)

df5 = df4.drop(0)
df5

df5_md = df5.to_markdown()

print(df5_md)

|    |   Rank2019 |   Rank2018 |  Company Name                        |    Market Cap  (Rs. Cr) |    Revenue  Change% |  PAT  Change%   |
|---:|-----------:|-----------:|:-------------------------------------|------------------------:|--------------------:|:----------------|
|  2 |          1 |          2 | Reliance Industries Ltd.             |                949280   |               43.36 | 9.74            |
|  3 |          2 |          1 | Indian Oil Corporation Ltd.          |                125768   |               24.99 | -21.69          |
|  4 |          3 |          3 | Oil And Natural Gas Corporation Ltd. |                172136   |               29.97 | 37.95           |
|  5 |          4 |          4 | State Bank of India                  |                290141   |                7.88 | LP              |
|  6 |          5 |          5 | Tata Motors Ltd.                     |                 48588.3 |                2.06 | PL              |
|  7 |          6 |          6 | Bharat Petroleum Corporation Ltd.    |                111551   |               26.1  | -13.39          |
|  8 |          7 |          7 | Hindustan Petroleum Corporation Ltd. |                 45237.3 |               24.94 | -7.31           |
|  9 |          8 |          8 | Rajesh Exports Ltd.                  |                 20200   |               -6.35 | 2.08            |
| 10 |          9 |          9 | Tata Steel Ltd.                      |                 45758.5 |               15.42 | -23.94          |
| 11 |         10 |         10 | Coal India Ltd.                      |                126224   |               14.04 | 148.09          |

So we got the table we wanted

It may be done in more efficient ways, but this is my first such attempt on my own.

Sustainability

Pages

Monday, October 26, 2020

Economic Times Top 10 companies

To scrape the web,¶

to get the table,¶

of Economic Times¶

2019 Top 10 Companies¶

	Rank2019	Rank2018	Company Name	Market Cap (Rs. Cr)	Revenue Change%	PAT Change%
2	1	2	Reliance Industries Ltd.	949280.41	43.36	9.74
3	2	1	Indian Oil Corporation Ltd.	125768.47	24.99	-21.69
4	3	3	Oil And Natural Gas Corporation Ltd.	172135.97	29.97	37.95