java-charset-detector
Java SE 8+ Charset Detector, MIT (c) 2022 miktim@mail.ru
Encoding detection is based on the contents of the file, the frequency of the letters of the national alphabet and the built-in Java character encoders. Hint: exclude by extension explicitly binary files (.bin, .exe, .jar, , zip, etc.).
Russian is the default language. Java encoders:
x-UTF-16LE-BOM,UTF-16BE,x-UTF-32LE-BOM,x-UTF-32BE-BOM,UTF-8,windows-1251,cp866,koi8-r.
See https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
package org.miktim.lang
Overview:
Class CharsetDetector
Constructors:
CharsetDetector();
CharsetDetector(String encoders, String expected, String unwanted)
throws UnsupportedEncodingException;
// encoders: comma-separated national and multilingual Java encoder names.
// The order of encoders is significant;
// expected: a string of expected national letters in descending order of frequency;
// unwanted: a string of unwanted letters or null. Unwanted defaults:
// chars 0-31, except \t\r\n\f\b, and 0x7F(DEL).
Methods:
// detect methods return Java charset encoder name or null
String detect(byte[] buffer, int bytesIn); // initial bytes of the file
String detect(InputStream in) throws IOException;
String detect(File file) throws NoSuchFileException, IOException;
String[] listEncoders();
String getExpected();
String getUnwanted();
CharsetStatistics getStatistics();
Class CharsetStatistics see in sources
Usage: See ./test/detector/DetectorTest.java
Описание
100% Java SE 8+ file encoding detector. Is based on the contents of the file, the frequency of the letters of the national alphabet and the built-in Java character encoders.