Hungarian alphabetical order

19

3

For those who wish a lot more challenge then the old Spanish alphabetical order, let's take a look at how the Hungarian alphabet is ordered.

a, á, b, c, cs, d, dz, dzs, e, é, f, g, gy, h, i, í, j, k, l, ly, m, n, ny, o, ó, ö, ő, p, q, r, s, sz, t, ty, u, ú, ü, ű, v, w, x, y, z, zs

actually, q, w, x and y are not used in Hungarian words, but they are included for loanwords and foreign names. Foreign accented characters which are not part of the Hungarian alphabet (like ñ), have the same priority as the non-accented ones, but we disregard them for this challenge.

The rules, summarized:

  • Digraphs (cs, sz, etc.) and the trigraph (dzs) are considered as they were letters on their own.
cudar
cukor
cuppant
csalit
csata
  • If the same digraph or trigraph occurs twice directly after each other in a word, they are written in a simplified way: ssz instead of szsz, ddzs instead of dzsdzs but for the alphabetical order the non-simplified order is used. For example kasza < kaszinó < kassza, because kassza is used as k+a+sz+sz+a for the sake of ordering. Sometimes you can find the non-contracted version in a word, in case of compound words.
kasza
kaszinó
kassza
kaszt
nagy
naggyá
nagygyakorlat
naggyal
nagyít
  • capitalization doesn't matter, with the exception when the two words would be exactly the same without capitalization, in which case the lower case letter has priority
jácint
Jácint
Zoltán
zongora
  • The short and long versions of accented vowels have the same priority (a - á, e -é, i - í, o - ó, ö - ő, u - ú ü - ű), with a single exception: if the two words would otherwise be exactly the same, the short vowel has priority over the long vowel. Note, that the vowels with umlaut (ö and ü) are completely different characters from o and u.
Eger
egér
író
iroda
irónia
kerek
kerék
kérek
szúr
szül
  • Hyphens or spaces (for example, in compound words, names, etc.) are completely ignored
márvány
márványkő
márvány sírkő
Márvány-tenger
márványtömb

The task

Your program/function receives strings, composed of characters from the Hungarian alphabet (both lower- and upper-case), but a string might contain spaces or hyphens. For simplicity's sake, the minus sign (ASCII 45) can be used as a hyphen. Note that some characters (like the ő) are not part of ASCII. You can use any encoding you wish, if it supports all the required characters.

You have to order the lines correctly and display/return the result.

You can use any randomly ordered subset of the above examples for testing.

EDIT:

Please don't use any built-in or other way which already knows the Hungarian alphabetical ordering by itself. It would make the competition pointless, and take all the challenge from finding the best regular expression or the best code golfing tricks.

EDIT2:

To clear a clarification asked by isaacg: "two strings that only differ by capitalization and long vs. short vowels, but differs in both ways" : Although no rule in the official document explicitly addresses this question, an example found within points to the length of the vowel having more importance than the capitalization.

vsz

Posted 2016-03-11T17:53:18.497

Reputation: 7 963

@FryAmTheEggman Where do you see that? – Morgan Thrapp – 2016-03-11T17:57:46.030

@FryAmTheEggman : I don't quite understand it, what functions did i disallow? – vsz – 2016-03-11T17:57:46.273

I didn't post challenges for a while, is it now a custom to allow for the input to already have been read and stored in some container, so that your code doesn't have to do the reading? It might make sense, specialized golfing languages would have an unfair advantage otherwise, as they can read input with little to no code. – vsz – 2016-03-11T18:02:01.580

@FryAmTheEggman : ok, if no one argues otherwise, let's accept such solutions as well. I was only wondering because in previous questions it happened that i wrote programs and there were plenty of HTML submissions or other cases where the definition of "program" was quite ambiguous. – vsz – 2016-03-11T18:06:47.653

@FryAmTheEggman> I recommended the output format only because the words might include spaces. I would argue that having them in separate lines was not a "cumbersome" format. – vsz – 2016-03-11T18:08:30.050

For reference, we've got a meta post with default I/O formats (although I think that one might be linked from the post Fry already posted).

– Martin Ender – 2016-03-11T18:09:18.480

@MartinBüttner : Seems OK to me. However, how would you guarantee that the output in unambiguous regarding spaces. As there is at least one example, a word can, in fact, contain whitespace, and it can be regarded as a single "word" for the sake of ordering. Or is the word "display" which you don't agree with? What would you suggest instead? – vsz – 2016-03-11T18:12:49.077

9Man, I can't even memorize our proper alphabetical order. How am I going to program this? ;) – Andras Deak – 2016-03-11T18:13:23.700

Clarification: what is the proper ordering for two strings that only differ by capitalization and long vs. short vowels, but differs in both ways? – isaacg – 2016-03-11T18:23:03.097

@isaacg : good point, and interestingly, it's nowhere specified in the official document of the Hungarian Academy. Maybe I'll have to write them to ask for a clarification? However, an example in that very document has Eger preceding egér, so it seems the vowel length has priority. – vsz – 2016-03-11T18:32:46.077

1I've been trying to come up with a bound-to-fail counterexample, where an apparent digraph is actually two letters, such as malacsült or nyílászáró. I wonder if there are any (but you'd need a vocabulary to check for that, which is presumably not part of this challenge) – Andras Deak – 2016-03-11T21:19:07.593

1There is no example containing dzs – TheConstructor – 2016-03-11T21:58:58.293

@vsz vowel length doesn't affect order unless it's the only thing that remains. For example, "Eger" comes before "egér", but "mámor" comes before "maradi" – Bálint – 2017-06-05T15:26:39.700

1Grammar. Level: Hungarian! – sergiol – 2017-07-28T00:51:30.697

I haven't realized just how complex our ordering is up until this point – SeinopSys – 2017-10-29T10:59:29.453

Answers

4

Perl, 250

Includes +11 for -Mutf8 -CS.

use Unicode::Normalize;$r="(?=cs|zs|dz|sz|[glnt]y)";print map/\PC*
/g,sort map{$d=$_;s/d\Kd(zs)|(.)\K$r\2(.)/\L$+\E$&/gi;s/d\Kzs/~$&/gi;s/$r.\K./~$&/gi;s/(\p{Ll}*)(\w?)\s*-*/\U$1\L$2/g;$c=$_;$b=$_=NFD lc;y/̈̋/~~/d;join$;,$_,$b,$c,$d}<>

Uses the decorate-sort-undecorate idiom (AKA Schwartzian Transform), and multilevel sorting, where the levels are:

  • L1: compare base letters, ignore diacritics, case, and some punctuation.
  • L2: compare base letters and diacritics, ignore case and some punctuation.
  • L3: compare base letters, diacritics and case, ignore some punctuation.
  • Ln: tie-breaking byte-level comparison.

Internally, (ASCII 0x1C Field Separator — whose value is less than any character in the alphabet for this challenge) is used as a level separator.

This implementation has many limitations, amongst them:

  • No support for foreign characters.
  • Cannot disambiguate between contracted geminated (long) digraphs/trigraphs, and consonant+digraph/trigraph, e.g: könnyű should collate as <k><ö><ny><ny><ű>, while tizennyolc should collate as <t><i><z><e><n><ny><o><l><c>; házszám 'address = house (ház) number (szám)' should collate as <h><á><z><sz><á><m> and not as *<h><á><zs><z><á><m>.
  • Collation for contracted long digraphs is not that consistent (but it is stable): we disambiguate at the identical level (ssz <n szsz, ..., zszs <n zzs ); glibc collates the short forms before the full forms (ssz < szsz, ..., zzs < zszs ), ICU collates the long forms before the short forms starting at L3 Case and Variants (szsz <3 ssz, ..., zszs <3 zzs )

Expanded version:

use Unicode::Normalize;

$r="(?=cs|zs|dz|sz|[glnt]y)";   # look-ahead for digraphs

print map/\PC*\n/g,             # undecorate
  sort                          # sort
  map{                          # decorate

          $d=$_;                # Ln: identical level

          # expand contracted digraphs and trigraphs
          s/d\Kd(zs)|(.)\K$r\2(.)/\L$+\E$&/gi;

          # transform digraphs and trigraphs so they 
          #  sort correctly
          s/d\Kzs/~$&/gi;s/$r.\K./~$&/gi;

          # swap case, so lower sorts before upper
          # also, get rid of space, hyphen, and newline
          s/(\p{Ll}*)(\w?)\s*-*/\U$1\L$2/g;

          $c=$_;                # L3: Case

          $b=$_=NFD lc;         # L2: Diacritics

          # transform öő|üű so they sort correctly
          # ignore diacritics (acute) at this level
          y/\x{308}\x{30b}\x{301}/~~/d;

                                # L1: Base characters
          join$;,$_,$b,$c,$d
  }<>

†. Some well-known multi-level collation algorithms are the Unicode Collation Algorithm (UCA, Unicode UTS#10), ISO 14651 (available at the ISO ITTF site) the LC_COLLATE parts at ISO TR 30112 (draft available at the ISO/IEC JTC1/SC35/WG5 home) which obsoletes ISO/IEC TR 14652 (available at the ISO/IEC JTC1/SC22/WG20 home) and LC_COLLATE at POSIX.

‡. Doing this correctly would require a dictionary. ICU treats weirdly capitalized groups as non-contractions/non-digraphs/non-trigraphs, e.g: ccS <3 CcS <3 cCs <3 cCS <3 CCs <3 cS <3 cs <3 Cs <3 CS <3 ccs <3 Ccs <3 CCS

ninjalj

Posted 2016-03-11T17:53:18.497

Reputation: 3 018

You should be able to save some bytes using my expansion RegExp. – TheConstructor – 2016-03-19T14:48:01.413

6

Java 8, 742 bytes

Could reduce by another 3 bytes naming the function s instead of sort or another 16 bytes if not counting class-definition.

public class H{String d="cs|dzs?|gy|ly|sz|ty|zs";void sort(java.util.List<String>l){l.sort((a,b)->{String o="-a-á-b-cs-dzs-e-é-f-gy-h-i-í-j-k-ly-m-ny-o-ó-ö-ő-p-q-r-sz-ty-u-ú-ü-ű-v-w-x-y-zs-";int i=c(r(a),r(b),r(o));return i!=0?i:(i=c(a,b,o))!=0?i:b.charAt(0)-a.charAt(0);});}String r(String a){for(int i=0;i<8;i++)a=a.toLowerCase().replace("ááéíóőúű".charAt(i),"aaeioöuü".charAt(i));return a;}int c(String a,String b,String o){a=n(a);b=n(b);while(!"".equals(a+b)){int i=p(a,o),j=p(b,o);if(i!=j)return i-j;a=a.substring(i%4);b=b.substring(j%4);}return 0;}int p(String a,String o){a=(a+1).replaceAll("("+d+"|.).*","-$1");return o.indexOf(a)*4+a.length()-1;}String n(String a){return a.toLowerCase().replaceAll("(.)(?=\\1)("+d+")| |-","$2$2");}}

Can be used like this:

new H().sort(list);

Test-suite:

public static void main(String[] args) {
    test(Arrays.asList("cudar", "cukor", "cuppant", "csalit", "csata"));
    test(Arrays.asList("kasza", "kaszinó", "kassza", "kaszt", "nagy", "naggyá", "nagygyakorlat", "naggyal",
            "nagyít"));
    test(Arrays.asList("jácint", "Jácint", "Zoltán", "zongora"));
    test(Arrays.asList("Eger", "egér", "író", "iroda", "irónia", "kerek", "kerék", "kérek", "szúr", "szül"));
    test(Arrays.asList("márvány", "márványkő", "márvány sírkő", "Márvány-tenger", "márványtömb"));
}

private static void test(final List<String> input) {
    final ArrayList<String> random = randomize(input);
    System.out.print(input + " -> " + random);
    new H().sort(random);
    System.out.println(" -> " + random + " -> " + input.equals(random));
}

private static ArrayList<String> randomize(final List<String> input) {
    final ArrayList<String> temp = new ArrayList<>(input);
    final ArrayList<String> randomOrder = new ArrayList<>(input.size());
    final Random r = new Random();
    for (int i = 0; i < input.size(); i++) {
        randomOrder.add(temp.remove(r.nextInt(temp.size())));
    }
    return randomOrder;
}

yielding

[cudar, cukor, cuppant, csalit, csata] -> [csata, cudar, cuppant, csalit, cukor] -> [cudar, cukor, cuppant, csalit, csata] -> true
[kasza, kaszinó, kassza, kaszt, nagy, naggyá, nagygyakorlat, naggyal, nagyít] -> [naggyá, kassza, kaszinó, nagygyakorlat, nagyít, nagy, kaszt, kasza, naggyal] -> [kasza, kaszinó, kassza, kaszt, nagy, naggyá, nagygyakorlat, naggyal, nagyít] -> true
[jácint, Jácint, Zoltán, zongora] -> [Zoltán, jácint, zongora, Jácint] -> [jácint, Jácint, Zoltán, zongora] -> true
[Eger, egér, író, iroda, irónia, kerek, kerék, kérek, szúr, szül] -> [egér, Eger, kerék, iroda, író, kerek, kérek, szúr, irónia, szül] -> [Eger, egér, író, iroda, irónia, kerek, kerék, kérek, szúr, szül] -> true
[márvány, márványkő, márvány sírkő, Márvány-tenger, márványtömb] -> [márványtömb, márványkő, Márvány-tenger, márvány sírkő, márvány] -> [márvány, márványkő, márvány sírkő, Márvány-tenger, márványtömb] -> true

Ungolfed:

public class HungarianOrder {

    String d = "cs|dzs?|gy|ly|sz|ty|zs";

    void sort(java.util.List<String> l) {
        l.sort((a, b) -> {
            String o = "-a-á-b-cs-dzs-e-é-f-gy-h-i-í-j-k-ly-m-ny-o-ó-ö-ő-p-q-r-sz-ty-u-ú-ü-ű-v-w-x-y-zs-";
            int i = c(r(a), r(b), r(o));
            return i != 0 ? i
                    : (i = c(a, b, o)) != 0 ? i
                            : b.charAt(0) - a.charAt(0);
        });
    }

    // toLower + remove long accent
    String r(String a) {
        for (int i = 0; i < 8; i++)
            a = a.toLowerCase().replace("ááéíóőúű".charAt(i), "aaeioöuü".charAt(i));
        return a;
    }

    // iterate over a and b comparing positions of chars in o
    int c(String a, String b, String o) {
        a = n(a);
        b = n(b);
        while (!"".equals(a + b)) {
            int i = p(a, o), j = p(b, o);
            if (i != j)
                return i - j;
            a = a.substring(i % 4);
            b = b.substring(j % 4);
        }
        return 0;
    }

    // find index in o, then looking if following characters match
    // return is index * 4 + length of match; if String is empty or first character is unknown -1 is returned
    int p(String a, String o) {
        a = (a+1).replaceAll("("+d+"|.).*", "-$1");
        return o.indexOf(a) * 4 + a.length() - 1;
    }

    // expand ddz -> dzdz and such
    String n(String a) {
        return a.toLowerCase().replaceAll("(.)(?=\\1)("+ d +")| |-", "$2$2");
    }
}

I am using Java's List-type and the order()-function of it, but the comparator is all mine.

TheConstructor

Posted 2016-03-11T17:53:18.497

Reputation: 563

Impressive! I imagine you should be able to drop the list type specifier <String> and save a few characters at the cost of a few warnings? – Josh – 2016-03-11T21:22:30.257

@Josh nah, it would produce two casts as Java would infer Object as type of a and b then. I could probably get away defining a class-generic-parameter extending String, though. Also I am not expecting to have the shortest code. ;-) – TheConstructor – 2016-03-11T21:25:51.923

3

Python 3, 70

Saved 8 bytes thanks to shooqie.

I love Python. :D

Expects a list of strings.

from locale import*;setlocale(0,'hu')
f=lambda x:sorted(x,key=strxfrm)

Morgan Thrapp

Posted 2016-03-11T17:53:18.497

Reputation: 3 574

3Isn't this a standard loophole? – vsz – 2016-03-11T18:13:35.237

1@vsz Not as far as I know. Using built-ins is part of lots of challenges. – Morgan Thrapp – 2016-03-11T18:14:01.733

1@vsz It has too low an up- to down-vote ratio on the standard loopholes post to be counted as a standard, you'd have to explicitly ban it. – FryAmTheEggman – 2016-03-11T18:15:13.607

1Ok, done it. I considered explicitly banning it but i though it should be obvious that it would make the whole challenge a moot point. I'm sorry for the inconvenience. – vsz – 2016-03-11T18:16:52.483

I'll still upvote this answer as soon as a different, conforming answer appears from anyone, to give credit where credit is due. – vsz – 2016-03-11T18:24:34.850

1from locale import* saves a lot of bytes – shooqie – 2016-03-12T17:47:02.437

1This isn't a standard loophole. It's a built-in. Built-ins are allowed by default. This shouldn't have a single downvote in my opinion. – mbomb007 – 2016-03-23T19:41:30.390