Cross-Alphabetic Characters

17

1

Background

I saw this very interesting Venn Diagram on wikipedia: https://simple.wikipedia.org/wiki/Alphabet#/media/File:Venn_diagram_gr_la_ru.svg

It shows the letters (physical shapes) in different alphabets that are common between Russian, Greek and Latin alphabets.

The Challenge

Given an input string of characters from any of the three scripts shown (i.e capitalised Greek, Cyrillic or Latin letters), output a percentage of the string that matches each language. Repeated letters count every time.

For example, FFLURS is all latin-only characters, so the output is FFLURS is 100% Latin, 0% Cyrillic and 0% Greek.

By contrast, TOX BEAM PHP is formed completely of characters that appear in all three languages, so the output is TOX BEAM PHP is 100% Cyrillic, 100% Greek and 100% Latin.

Lookup Tables

The Venn diagram is the source of the letters below:

Latin Only:

J,G,S,D,Q,U,V,W,F,L,R

Latin and Cyrillic:

C, С

Latin and Greek:

I,Ι, N,Ν, Z,Ζ

Greek and Cyrillic

Φ,Ф, Π,П, Γ,Г, Λ,Л

All three:

A,B,E,O,M,K,T,X,Y,H,P (and equivalents in greek and cyrillic), (space)

The rest ...are Greek-only or Cyrillic-only.

IMPORTANT NOTE

Unicode defines (for example) "A" in (at least) three separate ways - one for each language. Whichever is used in the input (&#0391, &#0410 or &#0041), the program should understand it as matching all three languages. So, A (Latin), Α (Greek Alpha) and А (Cyrillic) should all give 100% Cyrillic, 100% Greek and 100% Latin as the answer.

Input Format

Any string containing exclusively А-Я, Α-Ω, A-Z and (space). These characters can be repeated multiple times within the string.

Output Format

The output can be in any format, provided the function produces consistent results. I would like to see output in the format I show in my examples (FFLURS is 100% Latin, 0% Cyrillic and 0% Greek), but in order to make the challenge more open to anyone I'm happy to accept arrays/strings of percentages/ratios:

[100,0,0],

100 0 0

[1.0 0.0 0.0]

as long as it's always clear which number is which language - so the output should be consistent.

Some more test cases

CINEMATICS -> CINEMATICS is 100% Latin, 70% Greek and 60% Cyrillic

ЩJЩFЩLΞRΞVΞW -> ЩJЩFЩLΞRΞVΞW is 50% Latin, 25% Cyrillic and 25% Greek

-> is 100% Cyrillic, 100% Greek and 100% Latin

ΨΩTESTINGЯЮ -> ΨΩTESTINGЯЮ is 63.6% Greek, 63.6% Latin and 45.5% Cyrillic

Winning Criteria

Usual rules and exclusions apply; shortest answer (bytes) wins.

(sandbox link: https://codegolf.meta.stackexchange.com/a/14984/62289)

For the avoidance of doubt, the only Unicode characters valid in the input are:

  • 0020, 0041-005A (Latin alphabet)
  • 0020, 0391-03A9 (Greek alphabet)
  • 0020, 0401, 0410-042F (Cyrillic alphabet)

But as shown in the lookup tables, characters may be cross-alphabetic.

Adding Jonathan Allan's table from the comments:

                                                 Latin  Greek  Cyrillic
U+0020     Space                                 1      1      1
U+0041  A  Latin capital letter A                1      1      1
U+0042  B  Latin capital letter B                1      1      1
U+0043  C  Latin capital letter C                1      0      1
U+0044  D  Latin capital letter D                1      0      0
U+0045  E  Latin capital letter E                1      1      1
U+0046  F  Latin capital letter F                1      0      0
U+0047  G  Latin capital letter G                1      0      0
U+0048  H  Latin capital letter H                1      1      1
U+0049  I  Latin capital letter I                1      1      0
U+004A  J  Latin capital letter J                1      0      0
U+004B  K  Latin capital letter K                1      1      1
U+004C  L  Latin capital letter L                1      0      0
U+004D  M  Latin capital letter M                1      1      1
U+004E  N  Latin capital letter N                1      1      0
U+004F  O  Latin capital letter O                1      1      1
U+0050  P  Latin capital letter P                1      1      1
U+0051  Q  Latin capital letter Q                1      0      0
U+0052  R  Latin capital letter R                1      0      0
U+0053  S  Latin capital letter S                1      0      0
U+0054  T  Latin capital letter T                1      1      1
U+0055  U  Latin capital letter U                1      0      0
U+0056  V  Latin capital letter V                1      0      0
U+0057  W  Latin capital letter W                1      0      0
U+0058  X  Latin capital letter X                1      1      1
U+0059  Y  Latin capital letter Y                1      1      1
U+005A  Z  Latin capital letter Z                1      1      0

U+0391  Α  Greek capital letter Alpha            1      1      1
U+0392  Β  Greek capital letter Beta             1      1      1
U+0393  Γ  Greek capital letter Gamma            0      1      1
U+0394  Δ  Greek capital letter Delta            0      1      0
U+0395  Ε  Greek capital letter Epsilon          1      1      1
U+0396  Ζ  Greek capital letter Zeta             1      1      0
U+0397  Η  Greek capital letter Eta              1      1      1
U+0398  Θ  Greek capital letter Theta            0      1      0
U+0399  Ι  Greek capital letter Iota             1      1      0
U+039A  Κ  Greek capital letter Kappa            1      1      1
U+039B  Λ  Greek capital letter Lambda           0      1      1
U+039C  Μ  Greek capital letter Mu               1      1      1
U+039D  Ν  Greek capital letter Nu               1      1      0
U+039E  Ξ  Greek capital letter Xi               0      1      0
U+039F  Ο  Greek capital letter Omicron          1      1      1
U+03A0  Π  Greek capital letter Pi               0      1      1
U+03A1  Ρ  Greek capital letter Rho              1      1      1

U+03A3  Σ  Greek capital letter Sigma            0      1      0
U+03A4  Τ  Greek capital letter Tau              1      1      1
U+03A5  Υ  Greek capital letter Upsilon          1      1      1
U+03A6  Φ  Greek capital letter Phi              0      1      1
U+03A7  Χ  Greek capital letter Chi              1      1      1
U+03A8  Ψ  Greek capital letter Psi              0      1      0
U+03A9  Ω  Greek capital letter Omega            0      1      0

U+0401  Ё  Cyrillic capital letter Io            0      0      1

U+0410  А  Cyrillic capital letter A             1      1      1
U+0411  Б  Cyrillic capital letter Be            0      0      1
U+0412  В  Cyrillic capital letter Ve            1      1      1
U+0413  Г  Cyrillic capital letter Ghe           0      1      1
U+0414  Д  Cyrillic capital letter De            0      0      1
U+0415  Е  Cyrillic capital letter Ie            1      1      1
U+0416  Ж  Cyrillic capital letter Zhe           0      0      1
U+0417  З  Cyrillic capital letter Ze            0      0      1
U+0418  И  Cyrillic capital letter I             0      0      1
U+0419  Й  Cyrillic capital letter Short I       0      0      1
U+041A  К  Cyrillic capital letter Ka            1      1      1
U+041B  Л  Cyrillic capital letter El            0      1      1
U+041C  М  Cyrillic capital letter Em            1      1      1
U+041D  Н  Cyrillic capital letter En            1      1      1
U+041E  О  Cyrillic capital letter O             1      1      1
U+041F  П  Cyrillic capital letter Pe            0      1      1
U+0420  Р  Cyrillic capital letter Er            1      1      1
U+0421  С  Cyrillic capital letter Es            1      0      1
U+0422  Т  Cyrillic capital letter Te            1      1      1
U+0423  У  Cyrillic capital letter U             1      1      1
U+0424  Ф  Cyrillic capital letter Ef            0      1      1
U+0425  Х  Cyrillic capital letter Ha            1      1      1
U+0426  Ц  Cyrillic capital letter Tse           0      0      1
U+0427  Ч  Cyrillic capital letter Che           0      0      1
U+0428  Ш  Cyrillic capital letter Sha           0      0      1
U+0429  Щ  Cyrillic capital letter Shcha         0      0      1
U+042A  Ъ  Cyrillic capital letter hard sign     0      0      1
U+042B  Ы  Cyrillic capital letter Yeru          0      0      1
U+042C  Ь  Cyrillic capital letter soft sign     0      0      1
U+042D  Э  Cyrillic capital letter E             0      0      1
U+042E  Ю  Cyrillic capital letter Yu            0      0      1
U+042F  Я  Cyrillic capital letter Ya            0      0      1

simonalexander2005

Posted 2018-03-17T14:36:46.573

Reputation: 1 157

1Welcome to PPCG! This is a nice first challenge. :) A note on your UTF-8 note: it's not UTF-8 which defines multiple versions of these characters but Unicode (and UTF-8 is just one specific way to encode Unicode code points). Also, since this part is rather important for answers to get right, you might want to include the explicit list of all Unicode characters that need to be handled correctly. – Martin Ender – 2018-03-17T14:48:51.903

@ngn yes, thanks. – simonalexander2005 – 2018-03-17T15:41:22.067

@JonathanAllan, ngn: I agree, I intended just the letters that are in the Venn diagram - but all representations of those letters are valid - like I try to say in the question, there are multiple unicode representations of "A", for example – simonalexander2005 – 2018-03-17T15:42:33.707

Cyrillic "El" is actually Greek "Lambda". Cyrillic "Pe" is Greek "Pi". – simonalexander2005 – 2018-03-17T15:48:43.143

@JonathanAllan Depending on font, Cyrillic Л can look exactly like Greek Λ. The Π-like shape is just another stylistic variation of the same letter. – ngn – 2018-03-17T16:02:21.863

@simonalexander2005 &#0391 - if you meant that as an html character entity, it should be Α – ngn – 2018-03-17T17:04:54.827

By the way, I still think A-Z should be referred to as the "English alphabet". The "Latin alphabet" is a broader concept. – ngn – 2018-03-17T17:08:43.010

Answers

4

Jelly, 56 bytes

A hash may well be shorter.

O:⁹:2;ON©œị“ŒḂI4ƥƒⱮıtɱN¦“¤COṙṚ¹`“ÑṂḄẈɼ]ġÐ’b4¤+4Bṙ®Ḣµ€S÷L

A monadic link returning a list of ratio amounts in the order English, Greek, Russian.

Try it online!
...or see a fully formatted output (including the implied rounding to one decimal place)

How?

We wish to have code which translates each possible character to a triple of ones and zeros representing whether they belong to each of the alphabets (much like the table in the question where C is 1 0 1). Once that is done we can sum across these and divide by the length to yield the ratios (between zero and one inclusive) - this is just S÷L (seen at the right of the code).

For any given character We know that if the ordinal is less than 256 it counts as English, if it is greater than 1024 it counts as Russian, and that if it is in-between 256 and 1024 it counts as Greek. As such taking the ordinal and integer dividing by 256 and then integer dividing the result by two yields 0 for the space and Latin characters (count as English), 1 for Hellenic (count as Greek), and 2 for Cyrillic (count as Russian). This is just O:⁹:2 in Jelly (seen at the left of the code).

If we rotate the triples of bits such that the natural alphabet bit* is the most significant then we can encode the lower two bits (as values between zero and three inclusive) in a look-up table with three rows and then rotate right by the numbers found above.

When we do this there are two things worthy of note - 1. Jelly has a rotate-left-by atom not a rotate-right-by one; 2. the Hellenic row of the look-up table would start with a zero (since Ξ is only Greek), thwarting a simple base-4 encoding (since leading zeros are not encodable). To alleviate (1) we can rotate-left by the negated value and to alleviate (2) we can encode our rows in reverse and index into them with the negative amount. This way we can negate both the row and column index with a single byte (N) as such our row and column indices may be calculated with O:⁹:2;ON.

Note that Jelly now has a multi-dimensional indexing atom, œị.

The table is formed from three large numbers which, once converted to base four, give the lower bits required for Cyrillic, Greek and Latin(+Space) respectively. They are of minimal length such that modular indexing by the negated ordinal values is possible - 47, 25, and 30 respectively (the .s are at unused indexes):

1: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 3 3 2 3 1 3 3 3 1 3 0 0 0 0 3 0 1 3 0 3 0 0 0 0 0 0
   . . . . . . . . . Я Ю Э Ь Ы Ъ Щ Ш Ч Ц Х Ф У Т С Р П О Н М Л К Й И З Ж Е Д Г В Б А Ё . . . . .

2: 3 2 3 1 0 3 1 3 0 2 3 3 0 0 3 2 3 3 0 0 3 2 3 0 1
   Μ Λ Κ Ι Θ Η Ζ Ε Δ Γ Β Α Ω Ψ Χ Φ Υ Τ Σ . Ρ Π Ο Ξ Ν

3: 3 3 0 0 0 3 0 0 0 3 3 2 3 0 3 0 2 3 0 0 3 0 1 3 3 0 0 3 0 2
   Y X W V U T S R Q P O N M L K J I H G F E D C B A . .   . Z

As an example consider the character Φ at Unicode point U+03A6 (which should yield [0,1,1]) it has an ordinal value of (3×16²+10×16+6 =) 934. (O:⁹:2 means 934//256//2 =) 1 identifying it as part of the Hellenic block. The ;O concatenates the ordinal giving us [1,934] and the N then negates both values giving us [-1,-934]. Since Jelly indexing is both 1-based and modular and there are three rows the -1 references the second of the three rows (row 2 in the above code-block), since the middle row has a length of 25 the -934 references the (-934%25 =) 16th entry in that row, which is 2. The code then adds four (the most significant bit) giving us 6 which converted to binary is [1,1,0]. The code then rotates this left by each of [-1,-934] and takes the head (i.e. the rotation left by -1, a rotation right by 1) yielding [0,1,1] as required.

* English for space since it's grouped with the Latin characters


Commented code

O:⁹:2;ON©œị“...“...“...’b4¤+4Bṙ®Ḣµ€S÷L - Link: list of characters        e.g.: "СЯ"
                                 µ€    - for €ach character:                С       Я
O                                      -   cast to ordinal               1057    1071
  ⁹                                    -   literal 256
 :                                     -   integer division                 4       4
   :2                                  -   integer divide by 2              2       2
      O                                -   cast to ordinal               1057    1071
     ;                                 -   concatenate                  [2,1057] [2,1071]
       N                               -   negate                     [-2,-1057] [-2,-1071]
        ©                              -   copy to register for later
                          ¤            -   nilad followed by link(s) as a nilad:
           “...“...“...’               -     list of integers encoded in base 250 = [4951760157204492290900832256, 1043285073970097, 1081712651052809266]
                        b4             -     convert to base 4                    = [[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,3,3,2,3,1,3,3,3,1,3,0,0,0,0,3,0,1,3,0,3,0,0,0,0,0,0],[3,2,3,1,0,3,1,3,0,2,3,3,0,0,3,2,3,3,0,0,3,2,3,0,1],[3,3,0,0,0,3,0,0,0,3,3,2,3,0,3,0,2,3,0,0,3,0,1,3,3,0,0,3,0,2]]
         œị                            -   index into                       2       0                   ^--[-2,-1071]   [-2,-1057]--^
                           +4          -   add four                         6       4
                             B         -   convert to binary             [1,1,0] [1,0,0]
                               ®       -   recall from register       [-2,-1057] [-2,-1071]
                              ṙ        -   rotate left         [[1,0,1],[0,1,1]] [[0,0,1],[1,0,0]]
                                Ḣ      -   head                          [1,0,1] [0,0,1]
                                   S   - sum                                 [1,0,2]
                                     L - length                                 2
                                    ÷  - divide                            [0.5,0,1]
                                       -   i.e.: 50.0% Latin, 0% Greek, 100% Russian

Jonathan Allan

Posted 2018-03-17T14:36:46.573

Reputation: 67 804

Looks like you have some terminology there...does "hash" mean a construct like (keys)iị(values)Ʋ? – Erik the Outgolfer – 2018-03-17T21:20:20.340

I mean forming a distinct set of keys from some mathematical manipulation so one can index into a list of values, yes. So like ...ị“... or possibly ...ṃ“... (strictly speaking the monadic chain I formed is a hash function)

– Jonathan Allan – 2018-03-17T21:32:25.400

@ngn M kebard is anning – Jonathan Allan – 2018-03-18T00:21:20.830

Oh, I know how to fix that! Reconfigure your keyboard and type Cyrillic "у" and "о" instead of "y" and "o" :D – ngn – 2018-03-18T00:34:04.550

I have never seen such a long Jelly answer... great job! – simonalexander2005 – 2018-03-18T15:32:55.847

5

JavaScript (ES6), 197 179 bytes

Returns an array of 3 ratios in [0..1].

s=>[...s].map(_=>(x='b;C6cC6%c>b^[<$]_3--_c_acC-----$+aKHbKK[`H`H]'[(p=s[a='charCodeAt'](l++)%202%116%89)>>1][a]()-36,x/=p&1||8,L+=x/4&1,G+=x/2&1,C+=x&1),l=L=G=C=0)&&[L/l,G/l,C/l]

Try it online!

How?

We use the (rather inefficient) hash function % 202 % 116 % 89 to transform each character code into an index in [0..88]. The corresponding lookup table consists of 3-bit entries where bit #2 = Latin, bit #1 = Greek and bit #0 = Cyrillic. Using decimal digits, this gives:

76273722773722017732767267300071731711117377737577371111111111000775474476474767744474447

We append an extra 1 to get en even number of entries and encode this bit stream with printable ASCII characters in the range [37..99] (% to c), with 6 bits of payload data per character.

This leads to the following string:

b;C6cC6%c>b^[<$]_3--_c_acC-----$+aKHbKK[`H`H]

The offset was chosen to avoid characters such as \ that would have required escaping.

Arnauld

Posted 2018-03-17T14:36:46.573

Reputation: 111 334

3

Ruby, 165 bytes

->s{(0..2).map{|x|s.chars.map{|c|o=c.ord;(o<33?7:"ĝ]ē¯]÷W59WUė½ñĝĕ×ßoĝėÏė55#{?!*15}"[o-[913,1040,65][y=o>>7<=>7]].ord+226>>3*-~y)[x]*1.0}.sum/s.size}}

Try it online!

Edit: Significantly golfed the code, and most importantly, squeezed 3 translation sequences into one UTF-8 string. The original longer code is kept below for better readability and explanation of the logic.

Ruby, 211 bytes

->s{(0..2).map{|x|s.chars.map{|x|o=x.ord;o<33?7:o<91?"77517117317173771117111773"[o-65]:o<938?"7762737237673276702776722"[o-913]:"74764744444767776757767#{?4*15}"[o-1040]}.inject(0.0){|y,z|y+=z.to_i[x]}/s.size}}

Try it online!

May not be the most efficient approach, but does the job. Uses a translation table for each alphabet with character occurrence in different scripts encoded by the bits of the number (in the order: Latin, Greek, Russian). The output is an array of percentages in the same order.

To fix the outlier Ё case I extended the Russian-only block of 4-s from 10 positions at the end of the alphabet to 15. This way, Ё gets picked correctly with negative index (and we are not required to handle lowercase letters than correspond to these extra indices).

Kirill L.

Posted 2018-03-17T14:36:46.573

Reputation: 6 693

1

Retina 0.8.2, 230 bytes

.+
$&¶$&¶$&¶$&
T`ΓΔΘΛΞΠΣΦΨΩЁБГДЖ-ЙЛПФЦ-Я`_`.+(?=¶.+¶.+$)
T`CDFGJ\LQRSUVWЁБДЖ-ЙС-Я`_`.+(?=¶.+$)
T`DFGIJ\LNQRSUVWZΔΖΘΙΝΞΣΨΩ`_`.+$
¶(.*)
¶$.1$*
1
100$*
.
1
(1+)¶(\1)*1*¶(\1)*1*¶(\1)*1*
$#2 $#3 $#4

Try it online! Link includes test cases. Note: Output is in % truncated down to an integer, because increasing the precision made the script too slow and it timed out on TIO.

Neil

Posted 2018-03-17T14:36:46.573

Reputation: 95 035