Python Average Tabular Data Help

July 20, 2022 Post a Comment

Ok I have the following working program. It opens of a file of data in columns that is too large for excel and finds the average value for each column: Sample data is: Joe Sam Bob

Solution 1:

Here is a functional solution:

text = """Joe Sam Bob
1   2   3
2   1   3
NA 2 3
3 5 NA"""

def avg( lst ):
    """ returns the average of a list """
    return 1. * sum(lst)/len(lst)

# split that text
parts = [line.split() for line in text.splitlines()]
#remove the headers
names = parts.pop(0)
# zip(*m) does something like transpose a matrix :-)
columns = zip(*parts)
# convert to numbers and leave out the NA
numbers = [[int(x) for x in column if x != 'NA' ] for column in columns]
# all left is averaging
averages = [avg(col) for col in numbers]
# and printing
for name, x in zip( names, averages):
    print name, x

I wrote a lot of list comprehensions here so you can print out intermediate steps, but those can be generators of cause.

Solution 2:

[edited for clarity]

When reading items from a text file, they are imported as strings, not numbers. This means that if your text file has the number 3 and you read it into Python, you would need to convert the string to a number before carrying on arithmetic operations.

Now, you have a text file with colums. Each column has a header and a collection of items. Each item is either a number or not. If it is a number, it will correctly be converted by the function float, if it is not a valid number (this is, if the conversion does not exist) the conversion will raise an exception called ValueError.

So you loop through your list and items as it has been correctly explained in more than one answer. If you can convert to float, accumulate the statistic. If not, go on ignoring that entry.

If you need more info about what is "duck typing" (a paradigm which can be resumed as "better to ask for forgiveness that for permission") please check the Wikipedia link. If you are getting into Python you will hear the term very often.

Below I present a class which can accumulate an statistic (you are interested in the mean). You can use an instance of that class for every column in your table.

class Accumulator(object):
    """
    Used to accumulate the arithmetic mean of a stream of
    numbers. This implementation does not allow to remove items
    already accumulated, but it could easily be modified to do
    so. also, other statistics could be accumulated.
    """
    def __init__(self):
     # upon initialization, the numnber of items currently
     # accumulated (_n) and the total sum of the items acumulated
     # (_sum) are set to zero because nothing has been accumulated
     # yet.
     self._n = 0
     self._sum = 0.0

    def add(self, item):
     # the 'add' is used to add an item to this accumulator
     try:
        # try to convert the item to a float. If you are
        # successful, add the float to the current sum and
        # increase the number of accumulated items
        self._sum += float(item)
        self._n += 1
     except ValueError:
        # if you fail to convert the item to a float, simply
        # ignore the exception (pass on it and do nothing)
        pass

    @property
    def mean(self):
     # the property 'mean' returns the current mean accumulated in
     # the object
     if self._n > 0:
        # if you have more than zero items accumulated, then return
        # their artithmetic average
        return self._sum / self._n
     else:
        # if you have no items accumulated, return None (you could
        # also raise an exception)
        return None

# using the object:

# Create an instance of the object "Accumulator"
my_accumulator = Accumulator()
print my_accumulator.mean
# prints None because there are no items accumulated

# add one (a number)
my_accumulator.add(1)
print my_accumulator.mean
# prints 1.0

# add two (a string - it will be converted to a float)
my_accumulator.add('2')
print my_accumulator.mean
# prints 1.5

# add a 'NA' (will be ignored because it cannot be converted to float)
my_accumulator.add('NA')
print my_accumulator.mean
# prints 1.5 (notice that it ignored the 'NA')

Cheers.

Solution 3:

The following code handles varying counts properly, and also detects extra data ... in other words, it's rather robust. It could be improved by explicit messages (1) if the file is empty (2) if the header line is empty. Another possibility is testing explicitly for "NA", and issuing an error message if a field is neither "NA" nor floatable.

>>> import sys, StringIO
>>>
>>> data = """\
... Jim Joe Billy Bob
... 1   2   3     x
... 2   x   x     x  666
...
... 3   4   5     x
... """
>>>
>>> def get_averages(f):
...     headers = f.readline().split()
...     ncols = len(headers)
...     sumx0 = [0] * ncols
...     sumx1 = [0.0] * ncols
...     lino = 1
...     for line in f:
...         lino += 1
...         values = line.split()
...         for colindex, x in enumerate(values):
...             if colindex >= ncols:
...                 print >> sys.stderr, "Extra data %r in row %d, column %d" %
(x, lino, colindex+1)
...                 continue
...             try:
...                 value = float(x)
...             except ValueError:
...                 continue
...             sumx0[colindex] += 1
...             sumx1[colindex] += value
...     print headers
...     print sumx1
...     print sumx0
...     averages = [
...         total / count if count else None
...         for total, count in zip(sumx1, sumx0)
...         ]
...     print averages

Edit add here:

...     return headers, averages

...
>>> sio = StringIO.StringIO(data)
>>> get_averages(sio)
Extra data '666' in row 3, column 5
['Jim', 'Joe', 'Billy', 'Bob']
[6.0, 6.0, 8.0, 0.0]
[3, 2, 2, 0]
[2.0, 3.0, 4.0, None]
>>>

Edit

Normal usage:

with open('myfile.text') as mf:
   hdrs, avgs = get_averages(mf)

Solution 4:

Change your inner-most loop to:

    values = line.split(" ")
    for i in xrange(len(values)):
        if values[i] == "NA":
            continue
        sums[i] += int(values[i])
    numRows += 1

Solution 5:

Much smaller code:

with open('in', "rtU") as f:
    lines = [l for l in f if l.strip()]
    names = '\t'.join(lines[0].split())
    numbers = [[i.strip() for i in line.split()] for line in lines[1:]]
    person_data = zip(*numbers)
    person_data = [tuple(int(i) for i in t if i!="NA") for t in person_data]
    averages = map(lambda x: str(float(sum(x))/len(x)), person_data)

with open('out', 'w') as f:
    f.write(names)
    f.write('\n')
    f.write('\t'.join(averages))

I have tested this after John Machin left his comment. In response to his comments:

This was a bug that existed because I misread the problem. It has been fixed
I have tried to make that line a little more readable now, but to be honest, I don't understand why you called it obfuscatory in the first place
You have pointed out a logic error in my code. I guess I really shouldn't have been doing this in the middle of class... for this I apologize
I agree that readlines() was redundant. I didn't have a proper python interpreter to cross check this, so I left it in as a safety

Hope this is better.

Python Manual