PDA

View Full Version : Files header strings



Sam_Zen
08.11.2007, 04:49 AM
Almost every graphic file has a certain 'header part' in the file to identify the nature of the bitmap/video, before the actual data begins.
When e.g. a file gets the wrong extension to associate with, causing trouble, it's important to check the header line of the file, to identify which format one really is dealing with. So it can be corrected.
This can be checked by opening the bitmap-file with an ascii-editor, like NotePad or equivalents.
Codes will be there, probably shown as blocks, but some recognizable strings will be shown at the first lines.

Bitmaps:
BMP - First chars : "BM"
JPG - On first line : "JFIF"
JPG - From camera with EXIF data : On first line "Exif", two blocks, then "II"
PNG - On first line : "PNG"
GIF - First chars : "GIF89a" (very old GIFs : "GIF87")
TIF - (no compression) First chars : "II" or "MM"
JP2 - On first line : "jP"
PSP - First chars : "Paint Shop Pro Image File"

Video:
MPG - After three codes : "º!" (hex: BA 21)
AVI - On first line : "RIFF" and "AVI LIST"
WMV - First chars : "0&²uŽfÏ" (hex: 30 26 B2 75 8E 66 CF)
FLV - First chars : "FLV"
RM - First chars : ".RMF"
SWF - First chars : "FWS"
MP4 - On first line : "mp41"
MOV - Variable so far. I noticed the presence of the string "moov"

Some audio formats as well:
WAV - On first line : "RIFF" and "WAVEfmt"
WAV - (Compressed ADPCM) "RIFF" and "WAVEfmt 2"
AU - First chars : ".snd"
IFF - (Amiga) Some strings : "FORM - SVXVHDR - CHAN - BODY"
AIF - (Apple) Some strings : "FORM - AIFFCOMM - SSND"
WMA - First chars : "0&²uŽfϦ٠ª bÎl" (hex: 30 26 B2 75 8E 66 CF 11 A6 D9 00 AA 00 62 CE 6C)
OGG - First chars : "OggS"
FLAC - First chars : "fLaC"
RA - First chars : ".RMF"

Some other formats:
ZIP - First chars : "PK"

Sam_Zen
12.11.2007, 01:58 AM
I've recieved some comments about this thread, being not based on facts. As Matera wrote, this is the report of an observation.

And I want to state, that this survey is done with a plain ascii-viewer, not with a fancy word-processor.
Maybe some strings with 'odd' characters will be represented different in this post to different users, like those of WMV or FLA.
But then, that modified string could be recognizable in comparison as well.

I could have used a Hex-viewer as well, to be more precisely, but this would have forced me to describe the exact position in the file of every specific hex-string.

I must admit that the TIF format has incomplete information, because I almost never use it, so I have not much files to test.

MatthewW
12.11.2007, 04:17 PM
TIFF files can also start with MM. To be exact, the first two characters give the order of bytes (MM = Motorola format = high bits first, II = Intel format = low bits first). The next two characters are the 16-bit number "42" in the appropriate order, so the first four bytes in hexadecimal are 49 49 2a 00 (II*null) or 4d 4d 00 2a (MMnull*).

(This is one of the dumbest decisions ever. The idea was to allow different machines to handle TIFFs in their native format, but as all TIFF-reading software has to understand both formats in order for TIFFs to be portable it just means extra work for everyone with no real benefit.)

midora
12.11.2007, 08:00 PM
(This is one of the dumbest decisions ever. The idea was to allow different machines to handle TIFFs in their native format, but as all TIFF-reading software has to understand both formats in order for TIFFs to be portable it just means extra work for everyone with no real benefit.)

Why? This was a quite fair decision. Both machines can read files in their own byte order variant fast and getting the overhead for the other format.

MatthewW
12.11.2007, 09:37 PM
Why? This was a quite fair decision. Both machines can read files in their own byte order variant fast and getting the overhead for the other format.
Because they can't read the values quickly anyway. It actually takes longer to decide which way the bytes should be and then deal with them as a word than it would take to handle them as individual bytes with a fixed ordering, especially on modern processors where a branch can be costly. Formats such as JFIF and PNG specify the byte ordering and don't suffer for it, and TIFF's provision for two byte orderings merely means that code to handle TIFFs is unnecessarily bulky and slow.

Sam_Zen
13.11.2007, 12:30 AM
Nice info, MatthewW.

A good idea to mention the hex code as well.
But, only if necessary. I don't see a reason to repeat 'JFIF' with '4A 46 49 46'.
This was meant to have a quick look in the first place. Since everyone at least has notepad, it's the shortest way.
I use the Lister of TC for this, so I can switch views between ascii and hex.
So I will add the hex-code to the items above with the 'odd' characters, to give precise information.

Btw, this difference of Motorola and Intel already caused trouble in the DOS days on a XT..

midora
14.11.2007, 08:47 AM
Because they can't read the values quickly anyway. It actually takes longer to decide which way the bytes should be and then deal with them as a word than it would take to handle them as individual bytes with a fixed ordering.

But you are doing the decision which kind of decoder to use just ones when you open the file and not for each single word. So each processor has the optimal speed for its native byte order. So this is optimal.

j7n
23.11.2007, 02:28 AM
Hi Sam_Zen.

You had a great idea by starting this table of file headers. Very useful. In the past I had used a program called WhatFormat. The problem with it, besides VisualBasic, was that it only worked with formats deemed important for the author. A hex editor or Notepad for smaller files is the only reliable and universal tool. It's like an oscilloscope, where you can throw any signal and in time learn to recognize various patterns.

I've been guessing file formats using a hex editor for some time. I hope you don't mind if I make a few remarks. The special (Unicode) characters you have mentioned in the first post have little meaning since their appearance depends on the current codepage.

Newer flash animations may also begin with "CWS".

WMA, or I guess actually ASF, files can visually be identified by the presence if "0&" folowed later by "Seh". Then they may also contain metadata tags in Unicode beginning with "WM/".

The oldschool MPEG formats don't have a distinct header. Instead each packed in the stream begins with a predefined pattern of bits that carry information about the parameters of that packet. This is actually very good since most these MPEG streams can be cut and still be valid for immediate playback. MPEGs may be identified by repeating occurences of this packet header.

For example, MPEG-2 PS, as on Video DVDs, can be identified by reguarly repeating sequence of "00 00 01 BA 44" which visually appears as repeating "║D" (if using a DOS/OEM codepage).

Continued from the previous, each AC-3 frame starts with "0B 77".

MP3 files can't be reliably identified by the packet headers (http://mpgedit.org/mpgedit/mpeg_format/MP3Format.html). You could look for repeated "FF FB" for the most common near-CD-quality files. The most popular and industry standard LAME encoder will write is name and version in the stream. You may also encounter the following dwords near the very start of the file: "Xing" (VBR), "Info" (CBR made by LAME), "VBRI" (VBR by Fraunhoffer). MP3's tagged with the terrible ID3v2 standard start with "ID3".

Each packet of the MP2 data starts with "FF FD" for the most common near-CD-quality.

Each packet of DTS audio starts with "7F FE 80" for 44.1 kHz, surround. Needs additional verification for other configurations. Very important to know when extracting DTS from various containers such as SPDIF or CDDA.

Monkey's Audio (also known as APE, MAC) files start with "MAC".

Matroska media files will nicely report themselves on the first line as "matroska".

RAR archives always begin with "Rar!".

CorelDraw CDR files contain "RIFF" followed by "CDRBvrsn".

Zip. For additional verification, scroll to the end of file for a list of all archived filenames, each item also beginning with "PK".

Sam_Zen
23.11.2007, 03:38 AM
@ 7jn
An excellent contribution.
To add some more :
XM module tracker files start with "Extended Module: " plus track title.
IT module tracker files start with "IMPM" plus track title.
Difference between a stereo WAV and a 4 channel WAV : instead of the string "data", it's "qdata".
PK - files for a quick load of WAV files in Cool Edit. Start with "F1 06"
TER starts with "TERRAGENTERRAIN"
PDF starts "%PDF-" plus version
MID starts "MThd" - few bytes further "ÀMTrk" - or C0 4D 54 72 6B

A nice tool to investigate the contents of especially executables like .exe or .dll is Textscan (http://www.analogx.com/contents/download/program/textscan.htm) at AnalogX.
It can show possible ascii-strings inside a file, among the codes, for identification.

matera
24.11.2007, 12:22 AM
Speaking of hex viewers/editors, I have found Tiny Hexer from www.mirkes.de to be the most helpful of all tools. Unlike most hex editors, it has a multi-document interface, and it is very fast and configurable. It can be used to compare two different files side by side, by tiling the windows.

I have used it to repair a file with a damaged header, copying parts from a known good JPG to a corrupted one.

Sam_Zen
26.11.2007, 03:49 AM
A file with more profound information about some 100 formats, headers and syntax, can be found : here (http://www.xs4all.nl/~samzen/download/ffmts002.zip) .

ShelteredCoder
21.12.2007, 01:25 PM
FYI:

I use a software package that has a MIME type identifier piece that works by looking at these header strings. It uses an XML file to define all the file types. Heres the link (does not link to the site of the actual code base):http://home.mchsi.com/~jloyd01/mimetypes.xml

In addition, there is a site out there that has all this info on just about any file you want: http://filext.com/fextend.php