I have a dataset as below;
"birth_date_1:25 birth_date_2:august birth_date_3:1945 birth_place_1:france death_date:
" "birth_date_1:14 birth_date_2:june birth_date_3:1995 birth_place_1:dv≈Ør birth_place_2:kr√°lov√© birth_place_3:nad birth_place_4:labem birth_place_5:, birth_place_6:czech birth_place_7:republic " "birth_date_1:21 birth_date_2:february birth_date_3:1869 birth_place_1:blackburn birth_place_2:, birth_place_3:england death_date_1:12 death_date_2:march death_date_3:1917 " "birth_date_1:07 birth_date_2:july birth_date_3:1979 birth_place_1:ghana birth_place_2:, birth_place_3:accra " "birth_date_1:27 birth_date_2:february birth_date_3:1979 birth_place_1:durban birth_place_2:, birth_place_3:south birth_place_4:africa " "birth_date_1:1989 birth_place_1:lima birth_place_2:, birth_place_3:peru " "birth_date_1:5 birth_date_2:september birth_date_3:1980 birth_place_1:angola death_date: " "birth_date_1:1 birth_date_2:february birth_date_3:1856 birth_place_1:hampstead birth_place_2:, birth_place_3:london death_date_1:14 death_date_2:august death_date_3:1905 " "birth_date_1:28 birth_date_2:december birth_date_3:1954 birth_place_1:hickory birth_place_2:, birth_place_3:north birth_place_4:carolina death_date: " "birth_date: " "birth_date: birth_place: death_date: " "birth_date: birth_place_1:belfast birth_place_2:, birth_place_3:northern birth_place_4:ireland " "birth_date: birth_place: death_date: " "birth_date_1:28 birth_date_2:february birth_date_3:1891 birth_place_1:carberry birth_place_2:, birth_place_3:manitoba death_date_1:20 death_date_2:september death_date_3:1968 " "birth_date_1:4 birth_date_2:november birth_date_3:1993 birth_place_1:portim√£o birth_place_2:, birth_place_3:portugal "
Within these dataset I am trying to extract information as below;
25.08.1945 t France t NA 14.06.1995 t Dvůr Králové nad Labem,Czech Republic t 21.02.1896 t Blackburn,England t 12.03.1917 . . . 1989 t Lima,Peru t NA . . . NA t NA t NA NA t NA t NA NA t Belfast, Northern Ireland t NA . . 04.11.1993 t Portimeo,Portugal t NA
I wrote below code to achieve this however because of the several scenarios I will encounter in my dataset such as birth_date_1 information can be null, a month name or a year, the loop below I came up with feels like is going to fail somewhere and won't be feasible.
outputfile = open('ornek_box_seperated_update.csv','w',encoding="utf-8") inputfile = open('ornek_box_seperated.csv','r',encoding="utf-8") import numpy as np birthDatePlace = [[ np.nan for i in range(9) ] for j in range(20000)] for line in inputfile: d = line.split(":") print(d) d = line.split(d) d = "t".join(d) print(d) if(d<40 and d>0): birthDatePlace[line,1] = d elif(d<2020): birthDatePlace[line,3] = d if(d<40 and d>0 and isinstance(d)==str): birthDatePlace[line,2] = d elif(d<2020 and isinstance(d)==int): birthDatePlace[line,4] = d # this code planned to continue from here until cover the all birth place and death date information in required format outputfile.write(d) outputfile.write('n') outputfile.close()
I appreciate any help you can provide. I am kinda newbie in python and especially in regex or string extraction methodologies.
Thank you in advance for your kind support.