logs archiveBotHelp.net / Freenode / #389 / 2015 / August / 19 / 1
Firstyear
Need to isolate some changelog and replication performance issues. Any pointers where to start?
i've already looked at errors / access, checked some basic settings and made sure that cn=monitor was okay
tbordaz
Firstyear: it depends what is the performance issue (bad etime, cpu spike, consumer late compare to master...)
Firstyear
slow writes and slow replication
cpu is low, etime is low
No issues with reads either.
Quite a lot of messages about repl5_inc_result_threadmain: read result for message_id 2809
tbordaz
Firstyear: What do you mean by slow ? , the consumer is getting more and more late compare to the supplier ?
Firstyear
Yes.
tbordaz
What version is it ?
Firstyear
It does "eventually" replicate. But it's abnormally slow to replicate, and *very* slow to write.
389-ds-base-1.2.11.15-50.el6_6.x86_64
5 seconds to write compared to near instant for example. So I'm wondering where is the best place to investigate.
tbordaz
did you monitor the disk activity, correlate it with specific DS files ?
Firstyear
Nope, but I can certainly do that.
iostat is showing only 145 writes / s, and an await of 0.24 milliseconds
So in other words, low disk io
tbordaz
can you get some pstack on the consumer (the one that is late)
and also a couple of pstack from the supplier
Firstyear
Probably, but it will take me a second. This is a production system. It's not critical that it's going a little slow, but I want to know how to diagnose it
tbordaz
in conclusion: consumer cpu low, disk low and rapid update
Firstyear
Sounds about right.
tbordaz
in a first approach it would look like that master is not sending fast enough the updates
Firstyear
Hmm okay
tbordaz
what about disk/cpu on the master ?
Firstyear
Low as well.
I checked both
tbordaz
"slow writes and slow replication" do you mean it takes too much time for an update to go from Master->consumer ? How long ?
Firstyear
Writes are about 5 to 10 times slower than normal, and I can see incremental replications start and they take about 20 seconds, compared to our normal 1 or 2 seconds.
I've seen some changes take up to a few minutes to get sent across between master to consumer.
Reads are unaffected.
Anyway, I'll get some pstacks and take a look. I was hoping there might be some better tools that I didn't know about before I went for the debugger.
tbordaz
What are the update: LDAP_MOD ? MOD_ADD/MOD_DEL/MOD_REPL ?
Are you updating differents entries or always the same set of them ?
Firstyear
It's mostly mods, mainly group members. Different groups every time, different members. I was suspicious of referential integrity,
Again, why I wanted to know if there were some tracing or other stats / tools I could look at.
ANyway, about to take the pstack
Okay, got them. Will go through them now.
The consumer seems to oly have one one thread "writing" and that's doing an index update. Nothing major there. The supplier appears to have quite a number of threads in ldbm_back_modify (), and a couple in index updates, so I assume it's just database contention
Is there a way to show number of writers / locks on a bdb file?
db_stat perhaps?
tbordaz
'referential integrity' you should see high cpu and likely high io
Could you do the following search on one the of problematic group:
ldapsearch -D "cn=directory manager" -w xxx -b "<group_dn>" -s base nscpentrywsi | wc
Firstyear
Will do
tbordaz
If there is several threads (on master) in back_modify, you should see some etime>0 on master. No ?
Firstyear
etime on reads is still low
Wow. What is this nscpentrywsi magic.
I'll look for some mods etime
Yeah, normally the modify etime is 0, but sometimes it spikes to 30
tbordaz
Firstyear: it is a way to dump the ldif format of the entry (with all its attributes+replication_info)
I would like to check how large is your entry
Firstyear
This one with the high etime is 325 attributes.
tbordaz
325 attributes or attribute's values ?
Firstyear
attribute values
tbordaz
using 'wc' how large is the entry (using the nscpentrywsi dump) ?
Firstyear
249 of them are uniqueMember, and about 30 of those are uniqueMembers being deleted.
tbordaz
Did you change something in indexing ?
Firstyear
Nope
No changes for months.
Only started about 24 hours ago.
I see a lot of messages like repl5_inc_result_threadmain: read result for message_id
tbordaz
Well my understanding is that starting 24h ago, you started to see spiking (30s) updates on the master where it use to be 0s before. Correct ?
Firstyear
Correct
tbordaz
So may be we have to focus on why it is high (now) on master.
Having some delay of replication can be a consequence of ^^
Firstyear
Yeah
tbordaz
could you pastebin one of the master pstack ?
Firstyear
Sure.
tbordaz
What version of OS ?
Firstyear
el 6.6
http://fpaste.org/256604/
A lot of the high etimes are all on large groups
So I think I'm coming back to just that it's a function of refint and memberof updates.
Just interesting process to get to this
tbordaz
All write threads are waiting for one thread indexing (likely membship attribute). this one appears to be sorting values...
Firstyear
Besides pstack, is there a way to profile connections through what plugins they have been through? some logging param?
Yeah, I suspected so
Is indexing a single threaded op?
Sorry, as in, only one thread can update an index at a time I assume?
And if that's the case, I see sometimes user binds that fail are taking a long time to write their passwordRetryCount value, would the index write be blocking the write to the user?
I guess, are writes able to be multithreaded?
I'm not completely familar with that part of the codebase.
elkris
Firstyear: writes are multi threaded, but serialized at some point by the backend lock
tbordaz
Would it be possible you install debuginfo 389-ds and redo pstack ?
elkris
tbordaz: do you talk about the pstack you just sent to me ?
tbordaz
elkris: I am talking about http://fpaste.org/256604/
elkris
tbordaz: yes, a threadf doing indexing blocks the other threads from accessing the backend
tbordaz
elkris: I agree
elkris
i suppose the mod modifies a large group and there is a substring index for members
tbordaz
elkris: right, but what is surprising is that the same MOD were rapid 24h ago
elkris: and group are not that large (~300 members)
Firstyear: yesterday the MODs were rapid (0s) correct ?
Firstyear: same MODs (update of group members) ?
elkris
tbordaz: the same mod ? can you be sure ?
tbordaz
elkris: no :(
Firstyear: ^^^
elkris
tbordaz: from the stack qsort is taking the time, so either qsort changed or the group did grow in this time or it is a different group or mod
Firstyear
I am not sure about the mods yesterday being to large groups no
I've been tracing logs / conns and narrowing it down.
« prev 1 2 next »