- Let us download,
- the necessary libraries.
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import lxml
import pandas as pd
from tabulate import tabulate
- The link address is:link
Using URL open and
reading the URL using BeautifulSoup failed.
The error was
HTTP Error: Forbidden
- So I tried the following.
site= "https://economictimes.indiatimes.com/et500"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
- Let us test
soup.title
The output was:
<title>ET 500 Company List - List of Top Companies in India 2019 | Economic Times ET500</title>
- It worked.
*Find table rows
rows = soup.find_all('tr')
print(rows[:3])
[<tr class="table_head"><th colspan="2">Rank</th><th rowspan="2">Company Name</th><th rowspan="2">Market Cap <br/> (Rs. Cr)</th><th rowspan="2">Revenue <br/> Change%</th><th rowspan="2">PAT <br/> Change%</th></tr>, <tr class="table_head"><th>2019</th><th>2018</th></tr>, <tr class="data light"><td class="Rnk1 textC">1</td><td class="Rnk2 textC">2</td><td class="comp"><a class="etcmpnylist" href="/stocks/companyid-13215.cms" target="_blank" title="Reliance Industries Ltd.">Reliance Industries Ltd.</a></td><td class="textR">949,280.41</td><td class="textR">43.36</td><td class="textR">9.74</td></tr>]
- From Table rows let us look at table data items
list_rows = []
for row in rows:
row_td = row.find_all('td')
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells,'lxml').getText()
list_rows.append(cleantext) # WORKS
list_rows
['[]',
'[]',
'[1, 2, Reliance Industries Ltd., 949,280.41, 43.36, 9.74]',
'[2, 1, Indian Oil Corporation Ltd., 125,768.47, 24.99, -21.69]',
'[3, 3, Oil And Natural Gas Corporation Ltd., 172,135.97, 29.97, 37.95]',
'[4, 4, State Bank of India, 290,140.98, 7.88, LP]',
'[5, 5, Tata Motors Ltd., 48,588.33, 2.06, PL]',
'[6, 6, Bharat Petroleum Corporation Ltd., 111,550.97, 26.10, -13.39]',
'[7, 7, Hindustan Petroleum Corporation Ltd., 45,237.26, 24.94, -7.31]',
'[8, 8, Rajesh Exports Ltd., 20,200.00, -6.35, 2.08]',
'[9, 9, Tata Steel Ltd., 45,758.52, 15.42, -23.94]',
'[10, 10, Coal India Ltd., 126,223.50, 14.04, 148.09]']
- Now let us prpare the dataframe
df = pd.DataFrame(list_rows)
df
df1 = df.loc[2:]
df2 = df1[0].str.split(',',expand=True)
df2[0] = df2[0].str.strip('[')
df2[6] = df2[6].str.strip(']')
print(df2.head(3))
df2[7]=df2[3]+df2[4] # Join
df3 = df2.drop([3,4],axis=1)
df3 = df3[[0,1,2,7,5,6]]
The column containing market capitalisation had a ',' in its numbers.
While splitting the data frame df1, the column with market capitalization also split inadvertently.
The number was split in to two columns 3 & 4.
So the two columns were joined in col no 7.
Later dropped columns 3 and 4. And rearranged columns sequence in df3.
df3.head(3)
df3_md = df3.to_markdown()
print(df3_md)
Columns to be Renamed
df3.rename(columns = {0:'Rank2019',
1:'Rank2018',
2:' Company Name',
7:' Market Cap (Rs. Cr)',
5:' Revenue Change%',
6:' PAT Change%'},
inplace = True)
df3.head(3)
Now let us get the table header.
col_labels = soup.find_all('th')
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str,'lxml').get_text()
all_header.append(cleantext2)
dfh1 = pd.DataFrame(all_header)
dfh2 = dfh1[0].str.split(',',expand=True)
dfh2[0] = dfh2[0].str.strip('[')
dfh2[6] = dfh2[6].str.strip(']')
dfh_a = dfh2.drop([5,6],axis=1)
dfh_a.columns = dfh_a.iloc[0] # Add column names.
dfh_a.insert(1,'Rank2018','Rank2018',True) # Add a column.
dfh_a.rename(columns = {'Rank' : 'Rank2019'}, inplace = True)
frames = [dfh_a,df3]
df4 = pd.concat(frames)
df5 = df4.drop(0)
df5
df5_md = df5.to_markdown()
print(df5_md)
So we got the table we wanted
It may be done in more efficient ways, but this is my first such attempt on my own.