So I had a pretty simple question: how many MLS players are participating in the World Cup? Surely, I muttered to myself, this data must be really easy to find.
Nope.
So I decided to play with Claude (the Anthropic “AI” – though I still hate using the term “intelligence” for these Stochastic Plagiarism Engines) to figure it out. The result was way, way, way better than I envisioned. I will confess that I was doing this while watching games, so a lot of my struggles were probably because I was only paying half attention to what I was doing.
But I learned a few things along the way.
True Facts about the World Cup
So I ended up “building” an app – honestly it felt more like managing a surly teenager than coding – that lets me play with the world cup squad data, and I think the result is pretty amazing.
Here are some random things I’ve learned playing with the app:
- There are more players from the QSL in Qatar (29) than from Liga MX (26). The Mexican team has 12 players who play in Mexico, four each in Uruguay and Panama, and six in other leagues in the Americas.
- Qatar has the most players that play club soccer in their country (25 of 26); Uruguay, DR Congo, Curaçao and the Ivory Coast are tied for the least (none).
- About twice as many players play in German leagues (110) than in US Leagues (50); five players in the cup play professionally in the USL Championship. (This was surprising to me mostly because 4% of the players in the cup play in the US, which is quite a bit more than I expected. The MLS is coming, kids.) Also, twice as many players play in the US than play in Mexico (26), and only 12 who play in Mexico play for the Mexican national team.
I could keep going, but you get the drift. If you’re into soccer,1 you’ll probably enjoy playing with this thing.
Don’t Trust Data from Stochastic Plagiarism Engines.
Ever.
Because I was actually watching games, I decided the quickest route to answer my question was just ask the plagiarism engines. So I told Claude to get the data and generate a way to explore the data. Here is the prompt I used:
Can you create a description of where all the players in the 2026 World Cup play their club soccer? Please include a summary by region, and for clubs with more than two players a list by WC teams.
Wow, was the result bad. A couple squads only had two players. Some of the players were on the wrong squad. A bunch of players were not playing anywhere. It was basically a big, steaming pile of inaccuracy. Which is actually an exaggeration; what it did was far worse. It was mostly accurate, with a few completely boneheaded errors that would be easy to if you weren’t wary enough.
I spent a while trying to convince Claude to clean its own mess:
Please recheck the club team for every player. Petar Musa no longer plays for Benfica. Marko Pjaca plays for the [Eredivisie](https://en.wikipedia.org/wiki/Eredivisie) club [Twente](https://en.wikipedia.org/wiki/FC_Twente).
Claude: “Oh! It looks like many of my squad entries were based on preliminary/historical data rather than the final June 2 submissions. The [sites I used] didn’t render all the data I wanted. Let me fix eight squads.”
Did you double check the rest of the squads you identified against that authoritative list?
Claude: “To answer your question honestly: no, I didn’t systematically check every squad against it. I fetched the page twice but the token limit cut off around Group F/G each time, so I only had confirmed [site to be damned later] data for Groups A–F.”
After a few hours trying to clean the data (and running into my token limit), I realized that I was on a fools errand, and decided to get the data myself.
Don’t Trust Data on the Web
So remember “site to be damned later” above? It sure seemed authoritative and complete, and Claude was pretty excited when it found it. So I grabbed the data directly off the site and manually cleaned it all up myself. Then I asked Claude to make it into a spreadsheet. What came out was attractive,2 but when I inspected the data it was really bad. Many teams didn’t have enough players, the data for many players was incomplete, etc. It was plausible but very wrong.
So then I started wondering to myself, saying, “self, how does someone create a list that looks authoritative but is so very wrong? That’s what plagiarism engines do! Gosh, it’s almost like someone used… ohhhh.” Yep. When you look for this data, most of the sites you’ll find are created by these gibberish engines.
So I went searching for the data. It was a bit challenging to find (at least while also watching world cup soccer games, albeit… less interesting games). Many sources provide the data, but tend to put them into a multi-page “click here to see the team you’re interested in” (along with a ton of ads and popovers and surveillance cookies). I found sources that showed the data (along with a ton of ads and popovers and surveillance cookies), but most of them were similarly inaccurate. Finally I did what I probably should have done in the first place: I constrained the search to Wikipedia, and found what I needed.3
The Future is Garbage
So one takeaway from all this is that we will inevitably and inexorably find the internet polluted with half-baked data generated by these useful, powerful, and really untrustworthy plagiarism engines. Sites will generate polluted data, other sites will use that and pollute it further, and eventually it will be all garbage. Future’s so bright my kids talk about the inevitability of eco-terrorism.
But, in the end: Awesome Data
So I’m pretty sure the app is accurate now. But since I generated the app using Claude, there’s probably some hidden gibberish deep inside some squad – if you look at the app and see something inaccurate, please let me know.
Also, if there is additional data you’d like to see mixed in, I’d consider doing more work – but I’m not searching for more data. Apparently I’m not very good at it. So if you provide the data (along with the authoritative source you used to get it) I’ll consider adding it.
Have fun!
- Shut up, England. You invented the term “soccer” in the first place. ↩︎
- I fixed all the data and regenerated the spreadsheet. You can download the (hopefully accurate) spreadsheet here, if you want to play with the data yourself. Since everything I do is CC-BY-NC (reuse freely for non-commercial use, with attribution to me), If you want to do something commercial you should probably go get the raw data (see footnote 3). ↩︎
- This morning I did what I really should have done first. I constrained the search to the FIFA site, and found what I really wanted – authoritative, clean, no ads, just the data. I hate to suggest FIFA does anything right, but that was ok. ↩︎
