2020FIVB_PreliminaryRoundStats

The data is from Kaggle and the statistics for the preliminary round of 2020 FIVB (Fédération Internationale de Volleyball). Then I want to know more about the relationship between players and their information or statistics.

Team Japan

My favorite team in FIVB.

image from

1 Data

1.1 Kaggle

Kaggle: FIBV - 2020 - Statistics - Preliminary Round
Latest FIBV statistics about the international volleyball scenario.
You can also download this repository and then read the .csv file I cleaned on Rstudio.

read.csv("DataFrame/Players_stats2020.csv")

However, the problem is that we only have the statistics of players' behavior, and the dataset doesn't store the personal information of players such as position, height, weight, etc.

1.2 VolleyBox

VolleyBox
- Players' personal information
- You can also download this repository and then read the .csv file I cleaned on Rstudio.
Volleyball players' Facebook, and social media.
but no "API"
difficulty: we couldn't use the names of players to find their page on VolleyBox.

For instance, we want Yuji Nishida's page but the URL of his page is not "https://volleybox.net/yuji-nishida". Therefore, we searched "Yuji Nishida" on VolleyBox, and the URL of his page is "https://volleybox.net/yuji-nishida-p11693". Therefore, that is a problem for connecting to the URL of players by their name.

read.csv("DataFrame/Player_infoVolleybox2023.csv")

Step1. Use the Google search engine to collect their URL

plr = #players name 
url = paste0("https://www.google.com/search?q=Volleybox+", gsub(" ", "+", plr))
page = read_html(url)
  
links = page %>%
  html_nodes("a") %>% # get the a nodes with an r class
  html_attr("href")
  
link = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
link_volbox = Filter(function(u) any(grepl("https://volleybox.net/",u), grepl("-p",u), grepl("/clubs",u)), link)
plr_link = link_volbox

Step2. Collect their info.

The point is that you need to find the information position on the webpage. (press F12 to check the HTML) In my project, the position, height, weight, birth, spike, block, and dominant hand is in <dd class='info-data marginBottom10 '>, so I use Xpath to find them.

page = read_html(plr_link)

player_info = page %>%
  html_elements(xpath = "//*[@class = 'info-data marginBottom10  ']") %>%
  html_text()

However, if you want to collect many pages just like me, then you must be careful about the time between two read_html. If you connect the url too many times, the webpage server will block you out and your code will return "Error in open.connection(x, "rb") : HTTP error 429." I utilize Sys.sleep function to control the times between the connections.

Sys.sleep(1) #pause for 1 second

And I use random variables for the time.

Sys.sleep(runif(1,5,15))

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
DataFrame		DataFrame
Men_Statistics		Men_Statistics
Women_Statistics		Women_Statistics
1029.RData		1029.RData
DataFrame.R		DataFrame.R
PlayersInfo224.R		PlayersInfo224.R
README.md		README.md
VolleyDataClean.Rmd		VolleyDataClean.Rmd
collection_playerslinks.RData		collection_playerslinks.RData
data_collection.R		data_collection.R
dataframe_plrInfo.RData		dataframe_plrInfo.RData
player_info.RData		player_info.RData
players_AttScoSer.RData		players_AttScoSer.RData

YiHsinLu/2020FIVB_PreliminaryRoundStats

Folders and files

Latest commit

History

Repository files navigation

2020FIVB_PreliminaryRoundStats

Team Japan

1 Data

1.1 Kaggle

1.2 VolleyBox

Step1. Use the Google search engine to collect their URL

Step2. Collect their info.

2 Visualization

2.1 Color by positions

About

Resources

Stars

Watchers

Forks

Languages