Is it possible to normalize data in Excel so that I can ensure my data-set adheres to 3rd Normal Form when i import to Access

1

To give you a bit of background, I am a university student doing an assignment in Data Science. I have a dataset in Excel, based on football/soccer players from around the world. Currently, each player has an ID and each record includes their ID, name, club, photo, age, etc etc;

The trouble I am having is when I design the tables in Access, I cannot import the data from Excel AND normalize it due to some repeating data.

I have assigned each player an ID, and I want to break it up so that I have a table for (Club, Player, Player_Attributes, League,) but because I have over 17,000 players data and their club - I would have to manually assign each unique club an ID in Access then delete ALL repeating clubs from my excel spreadsheet so that only one instance of a club appears in the 'Club' table and a correlating 'clubID' column in the 'Player' table that links to the club table. Can I automatically delete all repeating instances of something in Excel AND ALSO replace that repetition with, let's say, a number?

Screenshot of my top 4 records I have within Excel. The spreadsheet contains over 17,000 records.

top 4 records I have within Excel
(Click image to enlarge)

MC123

Posted 2017-11-14T16:53:26.060

Reputation: 11

You may like to show some sample data of your input table, what transitive dependency you have identified and how you expect your output tables (that you would export to Access) look like. With some sample input and output tables, it may be easier to provide a possible solution. – patkim – 2017-11-14T20:15:28.640

Transitive dependancy - {player} -> {club} essentially, but I will be removing all transitive dependencies and they will be relational through the use of foreign key's. If you see my image, you will notice some of the statistics that I have for the players. The problem lies in the 'club' section, as well as 'League' and also 'Club logo' as multiple players can play for 1 given club, they share the same League, Club, and Club photo, but they are separate players. – MC123 – 2017-11-15T09:34:00.433

[Table1: Club] -> ClubID, ClubName, ClubLogo,LeagueID), [Table 2: League] -> LeagueID, LeagueName, [Table 3: Players] -> PlayerID, PlayerName,...,...,...,...,..,...,...,.. ClubID] [Table 4: PlayerAttributes] -> (PlayerAttID, PlayerID,...,...,...,...,...,...,..,..) (I have excluded non-relevant columns, there are too many to include here) – MC123 – 2017-11-15T09:42:28.663

Answers

0

Do it in steps:

Import into a 'raw data' table that represents the non-normalised data - essentially the Excel spreadsheet data verbatim (you don't need to create this table; the import wizard does it for you).

Extract Leagues, Clubs into their proper normalised tables by doing an Append query which selects distinct data from the raw table. E.g.:

INSERT INTO Leagues SELECT DISTINCT LeagueName FROM RawData.

Extract Players into the proper normalised player table. To do this, use an append query which joins the RawData table to the normalised League and Club tables, with the name as the join field (since you don't have the ID in the raw data). In this query SELECT the player normalised fields (name, age, etc) as well as the Club and League IDs. E.g.:

INSERT INTO Players
SELECT PlayerName, Age, LeagueID, ClubID
FROM RawData 
INNER JOIN Clubs ON RawData.ClubName = Clubs.ClubName
INNER JOIN Leagues ON RawData.LeagueName = Leagues.LeagueName

andrew

Posted 2017-11-14T16:53:26.060

Reputation: 208