The first step in processing this information was to combine all the files into one large Excel worksheet, more than 145,000 rows long, with each row indexed by the file number it came from and the line in that file, so that I could easily refer back to the original data. Each row was also given a unique record number. The original "raw data" has been uploaded with no further changes, and can be viewed by clicking on the Record Number (Rxxxx) provided with each entry.
The next step was identifying the individual players in some manner. Looking at the original file, the first column has a last name, with a corresponding first name in the second column (where one was known), and then there might be several blank lines before another name is present. Each last name/first name pair is an individual player, and all rows after that pair are associated with that player. Based on this pattern, the players were numbered, and subsequent entries were assigned to that player. This generated a list of more than 37,000 players. Some have as few as one entry, while Mr. William Setley has 55 entires.
Players with major league records were identified in Baseball-Reference, which started the process of matching players in the database with known players in Baseball-Reference. As Year/Team/League entries were sorted, they could be compared with rosters for (known) clubs in Baseball-Reference and matched up, entry-by-entry, expanding the list of matched players. To date, roughly 50% of players in the database (most with debuts before 1900) have been matched. Work continues matching entries for players after 1900.
While most entries included a Year/Team/League listing, some entries just contained biographical data in the final column. These entires turn up when viewing all records for a player, but do not show up in most of the other searches. Searches focus on the Year/Team/League entries. For any given club identified in Baseball-Reference before 1900, a roster can be generated from the database. These rosters may include players signed by a club, but who did not appear in a game with that club, as well as players who are not listed in Baseball-Reference for a club with an incomplete roster.
The last field in each entry includes references for the entry, biographcal notes on the player, and other miscellaneous information. I hope to clean this up to make this information searchable, but that is a slow process. To access this information for now, look at the full records for a player.
The original datafiles were created in Apple's Numbers, their version of Microsoft Excel. Dates in Excel are handled differently than in Numbers, and in transferring the data, a fraction of the dates were interpreted by Excel as a number. Thus "-4/6" became -0.666666667 in Excel. Others were turned into integers (related to how Excel stores dates internally). While some of these have been cleared up, only a comparison with the original datafile in Numbers will be able to resolve some of these issues.
During processing, some errors have been corrected. (For example, a year entered as 1977 instead of 1877.) Some records were removed entirely (blank lines, or duplicates). Some discrepancies between Baseball-Reference and this dataset were resolved based on newspaper records or other sources, while others were left alone. Processing continues on records after 1900. This dataset does not claim to be 100% accurate or 100% complete. It is meant to be a resource to point researchers in directions they might not have otherwise looked. To the degree that I can update it, I will. Records that I add will not have a link back to the original data.
In addition to the 65 files of player data, there are numerous files of data for individual legues and season. I hope to link those into this database eventually, but I have no clue how that is going to look.