[olug] sort -u vs uniq

William Mihalo wmihalo at gmail.com
Mon Mar 13 11:12:45 CDT 2017


I had to monitor break-in attempts at one of the national labs and used
msort to go through hundreds of ip addresses. Msort is part of fedora/rhel
repos.

Here's an example from https://www.linux.com/news/sorting-your-data-msort In
the following example you are sorting on the SRC field.

cat ips.txt
Apr 29 20:14:58 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.3.4 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4
DST=192.168.4.12 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.0.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth1 OUT=eth0 SRC=192.168.3.3
DST=192.168.3.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4
DST=192.168.0.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.0.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.0.133 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.1.33 LEN=76...
[wmihalo at asusdesk ~]$ msort  -l -t SRC= -c h -t DST= -c h ips.txt
Key 1 obligatory     tag SRC=     Increasing hybrid
Key 2 obligatory     tag DST=     Increasing hybrid
Reading from ips.txt.
Records processed:                          8
Sorting...
Records written:                            0Apr 29 20:15:48 fots kernel:
invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.0.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.0.133 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.1.33 LEN=76...
Apr 29 20:14:58 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2
DST=192.168.3.4 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth1 OUT=eth0 SRC=192.168.3.3
DST=192.168.3.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4
DST=192.168.0.33 LEN=76...
Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4
DST=192.168.4.12 LEN=76...

Bill Mihalo


On Mon, Mar 13, 2017 at 10:50 AM, Matthew G. Marsh <olug4mgm at paktronix.com>
wrote:

>
> Hmmm - 2 replies in one year.
>
> Also take a look at sort -k as this uses "keys" that are defined by
> fields. And in sort you can set fields using defined delimiters and then
> use keys to sort based on the fields.
>
> Thus if you define the delimiter as "." (the dot in IPv4) then you have
> four fields on which to sort AND you can sort on subsets of those fields
> (very handy for MAC addresses and IPv6 sorting).
>
> BTW FWIW - most *nix man pages are very lame regarding these features as
> they want you to use *bleep* info files. I long ago converted info to man
> and then added in the POSIX spec man pages (.p custom extensions) in case I
> needed to know the full story.
>
> HTH!
>
> mgm
>
> On Mon, 13 Mar 2017, Lou Duchez wrote:
>
> This page might have some useful information:
>>
>> http://unix.stackexchange.com/questions/75341/specify-the-so
>> rt-order-with-lc-collate-so-lowercase-is-before-uppercase
>>
>> As to what you experienced, I know I was once surprised to see a
>> PostgreSQL statement sort data differently between a Windows server and a
>> Linux server -- it's a vague memory, but I think Linux was evaluating sort
>> order by looking for a numeric component that precedes the rest of the
>> string (so "2beornottobe" was sorting before "1234imdeclaringathumbwar"
>> because "2" is less than "1234").  Is that what Linux is doing for you?
>> With a string like "108.78.42.145", maybe Linux sees that as "108.78"
>> followed by ".42.145".
>>
>> Can you foil this nefarious behavior by sorting by a non-numeric
>> character prefixed to the IP addresses, somehow?  I bet not even "C" can
>> mis-sort "A108.78.42.145" and "A69.38.74.12".
>>
>>
>> I'm trying to get a list of uniq IP addresses from a log file. I have a
>>> list of ALL IP addresses. Using sort -nu and sort -n | uniq give me 2
>>> different lists.
>>>
>>> A stare and compare make me think that sort -nu  only considers the
>>> first 2 octets as significant. RTFM of the sort man page indicates sort
>>> honors LC_COLLATE.
>>>
>>> <appear uninformed>
>>> LC_COLLATE isn't in env, so I'm assuming it's set at build/compile time
>>> when building sort or in the c libraries someplace?
>>> </appear uninformed -- hardly, stupid probably better tag... and not
>>> closed.>
>>>
>>> Could this be why the sort -u and uniq return differing output? I don't
>>> see anyplace to specify "how much" to consider significant when running
>>> sort. Anyone care to offer thoughts?
>>>
>>> Thanks.
>>>
>>>
>>> Noel
>>>
>>> _______________________________________________
>>> OLUG mailing list
>>> OLUG at olug.org
>>> https://lists.olug.org/mailman/listinfo/olug
>>>
>> _______________________________________________
>> OLUG mailing list
>> OLUG at olug.org
>> https://lists.olug.org/mailman/listinfo/olug
>>
>>
> --------------------------------------------------
> Matthew G. Marsh
> Special Email Addr for OLUG ;-}
> Phone: (402) 932-7250
> Email: olug4mgm at paktronix.com
> WWW:  http://www.paksecured.org
> --------------------------------------------------
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> https://lists.olug.org/mailman/listinfo/olug
>


More information about the OLUG mailing list