Removing conditional early return from engine_map() function
in case of insufficient free cachelines. The reasons are:
1. current implementation does not treat unssufficient free
cachelines condition as an error,
2. the check is based on stale request info, so it is inaccurate,
3. it is easier to hit more paths with functional tests,
4. partially mapping request from the freelist becomes more common
rather than being a corner case dependent on racy timings between
threads
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Move ref counts to their own cacheline - otherwise they pollute and cause
false sharing to fields nearby and cause a lot of cache bouncing between
physical CPUs.
Signed-off-by: Kozlowski Mateusz <mateusz.kozlowski@intel.com>
Grouping static fields together, while often changing ones get their own
cacheline, or some not used often/important fields.
Signed-off-by: Kozlowski Mateusz <mateusz.kozlowski@intel.com>
set_hot() depends on cacheline metadata status to determine
on which list the element is located (dirty cs clean list).
Thus at least hash bucket lock is required when calling
set_hot().
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
When issuing discard request over 512KiB OCF would trim this request and
overwrite req->core_line_count which would then cause this request to be
freed from wrong mpool.
This is fixed now by saving core_line_count that was set when allocating
this request that is never overwritten. This alloc_core_line_count is
then used to free the request from correct mpool.
Signed-off-by: Jan Musial <jan.musial@intel.com>
If user thread is preempted during tree/list update and another IO
is issued on the same CPU, the structure will be in undefined state.
This may result in hung tasks, if the tree stops being a tree and a loop exists -
tree search functions won't be able to end properly; or panics if a NULL value appears
suddenly in the preempted thread, after a null-check was already done.
Signed-off-by: Kozlowski Mateusz <mateusz.kozlowski@intel.com>
When allocation of request with map fails, we fallback to allocating
request with no map, and then allocate map separately. During request
put we need to distinguish between those two cases in order to deallocate
request properly.
Signed-off-by: Robert Baldyga <robert.baldyga@intel.com>
Early return from engine_map() in case of insufficient free
cachelines on the freelist is opportunistic, as both request
map info and freelist count are not accurate. Map info is stale
as it is to be refreshed in engine_map() after hash bucket
lock had been upgraded. Freelist count on other hand is subject
to change asynchronously.
The implementation assumption however is that after engine_map()
request is fully traversed (engine_map() is equivalent to
engine_lookup() followed by an attempt to map missing cachelines).
So in case of early return we must take care of repeating the
lookup.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
At this point cacheline status in request map is stale,
as lookup was performed before upgrading hash bucket lock.
If indeed all cachelines are mapped, this will be determined
in the main loop of engine_map().
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
This assures that cacheline with LOOKUP_INSERTED status
is always present on the LRU list.
This fixes an ENV_BUG() caused by an attempt to remove
a cacheline from LRU list which was not there. This
happened when cacheline was mapped from freelist
(LOOKUP_INSERTED) but the entire request mapping failed
and generic cleanup routines attempted to invalidate cacheline,
including removing it from the LRU list. As engine_set_hot()
is called after successfull mapping, the inserted cacheline was
not yet present on the LRU list.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Number of cachelines to evcit can't be greater than the number of unmapped
entries in request.
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
Split traversation into two distinct phases: lookup()
and lru set_hot(). prepare_cachelines() now only calls
set_hot() once after lookup and insert are finished.
lookup() is called explicitly only once in
prepare_cachelines() at the very beginning of the
procedure. If request is a miss then then map()
performs operations equivalent to lookup() supplemented
by an attempt to map cachelines. Both lookup() and
set_hot() are called via traverse() from the engines
which do not attempt mapping, and thus do not call
prepare_clines().
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Unmapping cachelines previously mapped from freelist before
eviction is a waste of resources. Also if map does not erarly
exit upon first mapping error, we can have request fully
traversed (and partially mapped) after mapping and thus
skip lookup in eviction.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Eviction changes allowing to evict (remap) cachelines while
holding hash bucket write lock instead of global metadata
write lock.
As eviction (replacement) is now tightly coupled with request,
each request uses eviction size equal to number of its
unmapped cachelines.
Evicting without global metadata write lock is possible
thanks to the fact that remaping is always performed
while exclusively holding cacheline (read or write) lock.
So for a cacheline on LRU list we acquire cacheline lock,
safely resolve hash and consequently write-lock hash bucket.
Since cacheline lock is acquired under hash bucket (everywhere
except for new eviction implementation), we are certain that
noone acquires cacheline lock behind our back. Concurrent
eviction threads are eliminated by holding eviction list
lock for the duration of critial locking operations.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
.. so that the main part, responsible strictly for mapping
given LBA to given collision index, is encapsulated in
a function ocf_map_cache_line with external linkage.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Changing sequential request detection so that a miss request is
recognized as sequential after needed cachelines are evicted
and mapped to the request in a sequential order.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
.. to make it clean that true means cleaner must lock
cachelines rather than the lock is already being held.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Allowing request cacheline lock to be called on partially
locked request. This is going to be usefull for upcomming
eviction improvements, where request will first have evicted
(LOOKUP_REMAPPED) cachelines assigned to it in a locked state,
followed by standard request cacheline lock call in order to
lock previously inserted (LOOKUP_HIT) or mapped from freelist
(LOOKUP_INSERTED) cachelines.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Function returns true if cacheline is locked (read
or write) by exactly one entity with no waiters.
This is usefull for eviction. Assuming caller holds
hash bucket write lock, having exlusive cacheline
lock (either read or write) allows holder to remap
cacheline safely. Typically during eviction hash
bucket is unknown until resolved under cacheline lock,
so locking cacheline exclusively (instead of locking
and checking for exclusive lock) is not possible.
More specifically this is the flow for synchronizing
cacheline remap using ocf_cache_line_is_locked_exclusively:
1. acquire a cacheline (read or write) lock
2. resolve hash bucket
3. write-lock hash bucket
4. verify cacheline lock is exclusive
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Reformat function that calculates how long cache/core is dirty
Update `dirty_for` types in functional tests
Values stored in info structs fields (both in cache and core structs)
are unsigned 64-bits ints but `dirty_for`s were unsigned 32-bits ints.
Use existing function to transform returned value to seconds.
Replace seconds stored in metadata with seconds.
Replacement was done if old value of replaced field was equal to zero.
Acquiring monotonic high precission timestamp is potentially
slow and it makes sense to compare the field's value
to zero before calling atomic function.
Signed-off-by: Slawomir Jankowski <slawomir.jankowski@intel.com>
Provide number of cachelines as the cacheline concurrency
construtor param instead of reading it from cache.
The purpose of this change is to improve testability.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Cacheline concurrency functions have their interface changed
so that the cacheline concurrency private context is
explicitly on the parameter list, rather than being taken
from cache->device->concurrency.cache_line.
Cache pointer is no longer provided as a parameter to these
functions. Cacheline concurrency context now has a pointer
to cache structure (for logging purposes only).
The purpose of this change is to facilitate unit testing.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
The main purpose of cacheline concurrency global lock
is to eliminate the possibility of deadlocks when
locking multiple cachelines.
Cacheline lock fast path does not need to acquire
this lock, as it is only opportunistically attempting
to lock all clines without wait. There is no risk
of deadlock, as:
* concurrent fast path will also only try_lock
cachelines, releasing all acquired locks if failed
to immediately acquire lock for any cacheline
* concurrent slow path is guaranteed to have
precedence in lock acquisition when conditions
for deadlock occure (both slowpath and fastpath
have acquired some locks required by the other
thread). This is because the fastpath thread will
back off (release acquired locks) if any one of the
cacheline locks is not acquired.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
If an ioclass is pinned but it exceeded its occupancy limit, it should be
evicted anyway.
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
.. in order to move primitives intended to be accessed
concurrently in separate CPU cache line.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Divide single global lock instance into 4 to reduce contention
in multiple read-locks scenario.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
1. new abbreviated previx: ocf_hb (HB stands for hash bucket)
2. clear distinction between functions requiring caller to
hold metadata shared global lock ("naked") vs the ones
which acquire global lock on its own ("prot" for protected)
3. clear distinction between hash bucket locking functions
accepting hash bucket id ("id"), core line and lba ("cline")
and entire request ("req").
Resulting naming scheme:
ocf_hb_(id/cline/req)_(prot/naked)_(lock/unlock/trylock)_(rd/wr)
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Top 50% least recently used cachelines are not promoted
to list head upon access. Only after cacheline drops to
bottom 50% it is considered as a candidate to promote
to list head.
The purpose of this change is to reduce overhead of
LRU list maintanance for hot cachelines.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Overflown partitions now have precedence over others during
eviction, regardless of IO class priorities.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Removing the logic for oportunistic partition overflow
reduction by evicting more cachelines than actually
required by the request being serviced.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
If request is hit, simply try to acquire cachelines instead of verifying
whether target partition's size is not exceeded.
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
If there is any dirty data on the cache associated with removed core,
we must flush collision metadata after removing core to make metadata
persistent in case of dirty shutdown.
This fixes the problem when recovery procedure erroneously interprets
cache lines that belonged to removed core as valid ones.
This also fixes the problem, when after removing core containing dirty
data another core is added, and then recovery procedure following dirty
shutdown assigns cache lines from removed core to the new one, effectively
leading to data corruption.
Signed-off-by: Robert Baldyga <robert.baldyga@intel.com>
Min and max values, keept as an explicit number of cachelines, are tightly
coupled with particular cache. This might lead to errors and mismatches after
reattaching cache of different size.
To prevent those errors, min and max should be calculated dynamically.
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
Since the request carries an explicit information about number of the
cacheliens to be reparted, no need of keeping the boolean information if some
of the request's cachelines are assigned to a wrong partition
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
Instead of redunant calculating number of cachlines to be reparted, keep this
information in request's info
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
If partition's occupancy limit is reached, cachelines should be evicted from
request's target partition.
Information whether particular partition eviction should be triggered is
carried as a flag by request which triggered eviction.
Signed-off-by: Michal Mielewczyk <michal.mielewczyk@intel.com>
Moving metadata implementation out of obsolete metadata_hash.c
to .c files corresponding to function declaration header files.
This requires adding shared header for metadata implementation
metadata_internal.h. Some metadata header files did not have
a corresponding .c file - in this case it is added in this
commit.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Metadata wrapper functions (calling iface->func) in header
files are changed to be declarations only. Hash interface
implementation functions in metadata_hash.c are given an
external linkage and are renamed to drop "hash" prefix.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Locks acquired in ocf_metadata_flush(/load)_all are
acquired only for the duration of queueing asynch
service for flush/load, no actual metadata accesses
are performed there.
Also flush/load all are always performed with metadata
marked as deinitialized (metadata reference counter freezed),
so no I/O is reading nor writing the metadata. The only source
of potential concurrent metadata accesses are other management
operations, which should be synchronized using management lock.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Initializing each stream with unique LBA ensures there are no initial
rbtree collisions, and thus helps to avoid clustering of all the streams
into one big linked list instead of forming performance friendly proper
tree structure.
Signed-off-by: Robert Baldyga <robert.baldyga@intel.com>
After writing metadata configuration to disk we must
send a flush request to make sure configuration sections
are commited to non-volatile storage.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Load properties before checking memory needs and obtain cache line size
from context rather than from cache state.
Signed-off-by: Rafal Stefanowski <rafal.stefanowski@intel.com>
Rather then passing whole structs, supply
_ocf_mngt_calculate_ram_needed() with just the values it actually uses.
Signed-off-by: Rafal Stefanowski <rafal.stefanowski@intel.com>
During recovery procedure there is no guarantee that checksums
of runtime sections were flushed correctly before dirty shutdown.
Signed-off-by: Robert Baldyga <robert.baldyga@intel.com>
WA write must follow follow the same two-pass pattern
as WI does. This change modifies WA engine to default to
WI in case of any miss (either partial or full), not only
partial miss.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Add second pass of write invalidate. It is necessary only
if concurrent I/O had inserted target LBAs to cache after
WI request did traversation. These LBAs might have been
written by WI request behind the concurrent I/O's back,
resulting in making these sectors effectively invalid.
In this case we must update these sectors' metadata to
reflect this. However we won't know about this after we
traverse the request again - hence calling ocf_write_wi
again with req->wi_second_pass set to indicate that this
is the second pass (core write should be skipped).
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
core_id should be set in this function. The fact that
it is missing might lead to incorrect behaviour e.g. in
case of promotion policy.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
In case of partial hit WO engine first reads data for the entire
request address range from core device. Then it plumbs it by fetching
dirty sectors from cache device.
For unindentified reason this leads to a data corruption in YCSB
workload A. After flushing dirty data and re-loading cache the
data is correct.
This change modifies WO read handler to read clean data from the
cache. This is not optimal, as the clean sectors are now read twice
in case of partial hit. For now it seems to be good enough work-around
for the data corruption problem.
The symptoms, combined with the fact that this change seems to make
the problem go away, indicates that at some point WB write handler
(and/or special I/O request handlers like discard) puts CAS in a
state where in-memory medata wrongly indicates that a sector is
clean while in fact it is dirty, as marked in the on-disk metadata.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Fail `ocf_mngt_cache_load` function with `OCF_ERR_INVAL`
error code when force flag is in use.
Log error message.
Closes#361
Signed-off-by: Slawomir Jankowski <slawomir.jankowski@intel.com>
ocf_engine_push_req_(front|back) must not dereference req
pointer after putting the request on queue list and unlocking
the queue. At this point handler interface may asynchronously
pick up the request, handle it and deallocate.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
There is no need to constantly hold metadata global lock
during collecting cachelines to clean. Since we are past
freezing dirty request counter, we know for sure that the
number of dirty cache lines will not increase. So the worst
case is when the loop realxes and releases the lock,
a concurent IO to CAS is serviced in WT mode, possibly
inserting and/or evicting cachelines. This does not interfere
with scanning for dirty cachelines. And the lower layer will
handle synchronization with concurrent I/O by acquiring
asynchronous read lock on each cleaned cacheline.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
In current implementation in case of fast media flushning
container may starve all concurrent containers flushing
due to continous rescheduling of offender requests to the
front of I/O queue. Pushing request to the back of IO
queue ensures FIFO handling and removes possibility of
starvation.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Lower layer is prepared to handle used cachelines by
acquiring asynchronus read lock. It is very likely that
by the time the cacheline is actually cleaned its lock
state changes. So checking the lock at the moment of
constructing dirty cachelines list makes little sense.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
Moving _ocf_mngt_flush_containers outside global metadata
critical section. All this function does is sort core lines
and add queue request.
This fixes stalls reported by Linux scheduler due to
IO threads waiting on global metadata RW semaprhore for
several minutes.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
a_req is allocated using env_vmalloc() so we need to free it
using env_vfree(), not env_free().
Signed-off-by: Robert Baldyga <robert.baldyga@intel.com>
Discard handling splits large request into several steps.
However the actual size of request map for discard was
determined based on original request size, not step request
size, resulting in waste of memory and allocations > 4K.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>
512K is the maximum request size for which request map
fits into one page (4K) regardless of cacheline size.
Signed-off-by: Adam Rutkowski <adam.j.rutkowski@intel.com>