miktim/java-charset-detector: 100% Java SE 8+ file encoding detector. Is based on the contents of the file, the frequency of the letters of the national alphabet and the built-in Java character encoders.

Encoding detection is based on the contents of the file, the frequency of the letters of the national alphabet and the built-in Java character encoders. Hint: exclude by extension explicitly binary files (.bin, .exe, .jar, , zip, etc.).

Russian is the default language. Java encoders:
x-UTF-16LE-BOM,UTF-16BE,x-UTF-32LE-BOM,x-UTF-32BE-BOM,UTF-8,windows-1251,cp866,koi8-r.

See https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

package org.miktim.lang

Overview:

Class CharsetDetector

Constructors:
CharsetDetector();
CharsetDetector(String encoders, String expected, String unwanted)
throws UnsupportedEncodingException;

// encoders: comma-separated national and multilingual Java encoder names.
// The order of encoders is significant;
// expected: a string of expected national letters in descending order of frequency;
// unwanted: a string of unwanted letters or null. Unwanted defaults:
// chars 0-31, except \t\r\n\f\b, and 0x7F(DEL).

Methods:
// detect methods return Java charset encoder name or null
String detect(byte[] buffer, int bytesIn); // initial bytes of the file
String detect(InputStream in) throws IOException;
String detect(File file) throws NoSuchFileException, IOException;

String[] listEncoders();
String getExpected(); 
String getUnwanted();
CharsetStatistics getStatistics();

Class CharsetStatistics see in sources

Usage: See ./test/detector/DetectorTest.java

java-charset-detector

Описание

Использование cookies

java-charset-detector

Mmiktimupdate README.txt 2 месяца назад32f241

Описание

Использование cookies