Trying to make a universal rating scale for Go

I tried making a linear regression between the https://www.goratings.org/ rating and the EGD rating, but the correlation is only 0.4572, 95%-confidence interval for the slope is [0.2237,0.36], not good enough.

I tried making a linear regression between the https://www.goratings.org/ rating and the EGD rating. I chose just 5 players who have more than one recorded win and loss on goratings.org, spread over many years, and that also have a solidly established EGD rating spanning many years. I got the following linear regression: EGD-rating=1888+0.2919*(goratings.org rating), but the correlation is only 0.4572, p-value is 2.109 * 10^-15. The 95%-confidence interval for the slope is [0.2237,0.36]. The scatter plot:

Because of the lack of overlap between professional and amateur Go players and their two incompatible rating systems, I wanted to glue the rating systems together using a linear regression on rating data of a few players who appear in both https://www.goratings.org/ and the EGD . First I used the following datapoints:

I have complained before about how there is little overlap between professional and amateur Go players. This lack of overlap is not just a lack of interaction between the two groups, but there isn't even a universal rating system for rating all players. Instead there is one system for amateurs and another one for professionals and these two aren't even comparable. The system for professionals doesn't even measure playing strength. Professional ranks are instead awarded for somewhat arbitrary achievements (even achievements that are unrelated to actual play, as honorary awards) and usually never taken away.

I would love to have a universal rating scale like FIDE's Elo system. I applaude goratings's attempt at creating a numerical rating scale for professionals that measures actual playing strength. But the site doesn't rate the vast majority of amateurs, so it's still not universal. Some of the players in goratings.org also appear in the EGD and so I could try to glue the two together using these players.

I searched a few players that would be suitable and tried to make a linear regression between their goratings.org rating and their EGD rating. First I used the following datapoints:

player	goratings.org rating	EGD rating	date
Mateusz Surma	2918	2667	2015-12-17
	2909	2720	2016-10-31
	2909	2720	2016-11-04
Pavol Lisy	2906	2652	2013-12-13
	2904	2721	2014-10-23
Ilya Shikshin	2864	2744	2013-12-16
	2934	2769	2013-09-03
	2941	2787	2017-06-11
	2953	2793	2017-10-13
	2956	2797	2017-12-12
	2966	2802	2018-06-13
	2971	2812	2019-10-10
Hans Pietsch	3048	2704	1997-06-27
Catalin Taranu	2840	2816	2002-06-27
	2838	2816	2002-10-24
	2838	2816	2002-11-14
Alexander Dinerchtein	3085	2740	2003-04-21
	3086	2740	2003-06-17
Ryan Li	3058	2771	2016-03-01
	3069	2771	2017-06-19
	3069	2771	2017-06-21
Ali Jabarin	2913	2692	2014-10-27
Artem Kachanovskyi	2908	2718	2016-10-31
	2908	2713	2016-10-02
	2908	2758	2018-06-16
Benjamin Lockhart	2866	2693	2016-06-06
Fernando Aguilar	2912	2696	2002-03-19
	2911	2696	2002-09-02
Antti Tormanen	2945	2707	2016-07-28
	2945	2707	2017-06-15
	2968	2707	2022-12-19
	2970	2707	2023-06-29
Fan Hui	3021	2812	2013-12-13
	3028	2807	2014-12-13
Andrii Kravets	2828	2682	2016-06-05
Stanislaw Frejlak	2840	2704	2021-06-08

player	goratings.org rating	EGD rating	date
Mateusz Surma	2918	2667	2015-12-17
	2909	2720	2016-10-31
	2909	2720	2016-11-04
Pavol Lisy	2906	2652	2013-12-13
	2904	2721	2014-10-23
Ilya Shikshin	2864	2744	2013-12-16
	2934	2769	2013-09-03
	2941	2787	2017-06-11
	2953	2793	2017-10-13
	2956	2797	2017-12-12
	2966	2802	2018-06-13
	2971	2812	2019-10-10
Hans Pietsch	3048	2704	1997-06-27
Catalin Taranu	2840	2816	2002-06-27
	2838	2816	2002-10-24
	2838	2816	2002-11-14
Alexander Dinerchtein	3085	2740	2003-04-21
	3086	2740	2003-06-17
Ryan Li	3058	2771	2016-03-01
	3069	2771	2017-06-19
	3069	2771	2017-06-21
Ali Jabarin	2913	2692	2014-10-27
Artem Kachanovskyi	2908	2718	2016-10-31
	2908	2713	2016-10-02
	2908	2758	2018-06-16
Benjamin Lockhart	2866	2693	2016-06-06
Fernando Aguilar	2912	2696	2002-03-19
	2911	2696	2002-09-02
Antti Tormanen	2945	2707	2016-07-28
	2945	2707	2017-06-15
	2968	2707	2022-12-19
	2970	2707	2023-06-29
Fan Hui	3021	2812	2013-12-13
	3028	2807	2014-12-13
Andrii Kravets	2828	2682	2016-06-05
Stanislaw Frejlak	2840	2704	2021-06-08

Unfortunately the correlation is just 0.1886 and the p-value is 0.2707, i.e. not a statistically significant difference from a random cloud of data samples. See the scatter plot here:

The dates are from when these players defeated Asian professionals. I hoped that around that time, the players' goratings.org rating would be relatively accurate. The rating can only be accurate for players who have both won and lost games, ideally not years apart. Unfortunately, the correlation was quite bad. Using this, I got a correlation of 0.1886 and a p-value of 0.2707, i.e. not a statistically significant difference from a random cloud of data samples. The 95%-confidence interval for the slope is [-0.1001,0.3457]. That is too big to be useful. See the scatter plot here:

I tried again with other data points. I chose just 5 players (Mateusz Surma, Alexandre Dinerchtein, Ilja Shikshin, Ali Jabarin, Artem Kachanovskyi) who have more than one recorded win and loss on goratings.org, spread over many years, and that also have a solidly established EGD rating spanning many years. The hope is that these players have ratings that are closer to their true ratings. I used more than a dozen data points for each of those players and calculated the linear regression with the same calculator. The best fit is EGD-rating=1888+0.2919*(goratings.org rating), but the correlation is still only 0.4572. At least the p-value is 2.109 * 10^-15, so at least we know the correlation is real. The 95%-confidence interval for the slope is still [0.2237,0.36] (too large for my taste) . At least the best-fit slope is close to 0.333, which is what we would expect from the traditional relationship between amateur and professional ranks, namely that a 9dan pro can give a 3-stone handicap to a 1dan pro. But that could be coincidence, I don't know whether the goratings.org rating was intended to have 100 points difference between successive professional ranks.

The low certainty is caused by all the players being so close in rating and the goratings.org rating being based on so few data points. A better regression would require some of the weakest Asian pros playing many games that are EGD-rated, until they have a well-established EGD rating. That is unlikely to happen. The alternative would be that one of the European pros becomes so strong that he or she is able to defeat strong Asian pros regularly. I don't have much hope for that happening either. There are of course far more Asian pros in the EGD, but their EGD rating can't be used. The problem is that the EGD rating of strong players changes only very slowly and only once per event. And the initial EGD rating is essentially random, being based on what the player claims (which sometimes has little relation to the actual playing strength). That means it can take dozens of events for the rating to converge to its true value. Just look at this for example: The rating rarely changes by more than 10 points per event. That means that for the EGD rating to fall by 100 points (the difference between 5dan and 6dan), the player would have to play more than 10 events, all resulting in rating losses. In reality it took Guo Juan almost 80 events to lose just 60 rating points. An Asian pro recorded for just 1 or 2 events can't possibly have an accurate EGD rating thanks to this slow change.

The improved scatter plot:

One possible alternative would be online play, but I don't know any accounts of strong professionals that regularly play ranked games online.

One possible alternative would be to use online play. But that requires all players to play on the same platform and it requires for a lot of players to have a known online account on that platform. Online platforms don't have a strict separation of amateurs and pros and allow for playing a lot of games. But without knowing who is who, that online rating can't be related to any offline rating. And of course online ratings don't fit offline ratings perfectly, so I would have to do at least two linear regressions, each of which has an error term. So far, I don't know online account names of even just one strong pro (I know of https://senseis.xmp.net/?KGSHighDanPlayers , but that's not useful - I need acccounts that still exist and play ranked games regularly) , so I have to postpone that idea.

PS: An idea that I had, namely using the error frequency distribution determined by Katago to guess the strength, has luckily been implemented already by someone else and is usable here: http://howdeepisyourgo.org/ . You just have to upload an SGF file and it will guess your rating. Unfortunately it is based on OGS data and thus does not extend up to professional strength, but the idea could be extended to that range with additional training data. The tool is based on the work "Strength estimation in the game of Go" by Peter Neubauer - I read it and the idea appears to be quite solid. I would love to use this as the basis for a universal rating scale, especially because you need only 1 game to establish rating, not dozens (as is the case for Elo, Glicko and the like).

Paralinguistic/connotation key:

Mocking
Sarcasm, e.g. "Homeopathy fans are a really well-educated bunch"
Statement not to be taken literally, e.g. "There is a trillion reasons not to go there"
Non-serious/joking statement, e.g. "I'm a meat popsicle"
Personal opinion, e.g. "I think Alex Jones is an asshole"
Personal taste, e.g. "I like Star Trek"
If I remember correctly
Hypothesis/hypothetical speech, e.g. "Assuming homo oeconomicus, advertisement doesn't work"
Unsure, e.g. "The universe might be infinite"
2 or more synonyms (i.e. not alternatives), e.g. "aubergine or eggplant"
2 or more alternatives (i.e. not synonyms), e.g. "left or right"
A proper name, e.g. "Rome"

One always hopes that these wouldn't be necessary, but in the interest of avoiding ambiguity and aiding non-native English speakers, here they are. And to be clear: These are not guesses or suggestions, but rather definite statements made by the author. For example, if you think a certain expression would not usually be taken as a joke, but the author marks it as a joke, the expression shall be understood as a joke, i.e. the paralinguistic/connotation key takes precedence over the literal text. Any disagreement about the correct/incorrect usage of the expression may be ascribed to a lack of education and/or lack of tact on the part of the author if it pleases you.